64 core EPYC Rome (Zen2)Architecture Overview?

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

yuri69

Senior member
Jul 16, 2013
677
1,215
136
Again, this is not black magic or anything new. Ampere's just released ARM Server Processor sports a very similar architecture, except it's monolithic (https://amperecomputing.com/wp-content/uploads/2018/02/ampere-product-brief.pdf). AMD certainly does not lack expertise in this.
Yea, cc interconnects for 32+ high-perf cores are normally and cheaply done. For example ThunderX2 and Falkor use a ring. No info about X-Gene3/Ampere. The latency thing creeps back (sometimes very scary).

Btw since people juggle numerous different dies, chiplets, etc. What about a monolithic 64c die? That would be a truly unexpected and performant SKU.

7nm based Fujitsu A64FX packs 48c each with a two huge 512b SIMD units + 4 IO cores. No L3 but still, why not? Check their core/CMG/chip/interconnect architecture.
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,596
136
Canard have historically been right.
8ccx for 64c it is.
I wouldn't bet on some beefy ipc uplift then. To keep dies size, and socket compatibility we also know will be there, there must be a very slim chance of wider arch or tdp would go through the roof. Even if it was 225w. 64c running 256b wide fpu at 180w is a stretch but not impossible. I think that kind of stuff is reserved for Zen 3. Takes time to design. It's a new core and we don't get that after 2 years. I think it's more like 5% ipc uplift in a tuned core where the uncore gets the major overhaul.
Anyways. As gamer I want 8ccx for better latency. And gladly in a 25w tdp form factor for mobile :)
 
  • Like
Reactions: yuri69

french toast

Senior member
Feb 22, 2017
988
825
136
Canard have historically been right.
8ccx for 64c it is.
I wouldn't bet on some beefy ipc uplift then. To keep dies size, and socket compatibility we also know will be there, there must be a very slim chance of wider arch or tdp would go through the roof. Even if it was 225w. 64c running 256b wide fpu at 180w is a stretch but not impossible. I think that kind of stuff is reserved for Zen 3. Takes time to design. It's a new core and we don't get that after 2 years. I think it's more like 5% ipc uplift in a tuned core where the uncore gets the major overhaul.
Anyways. As gamer I want 8ccx for better latency. And gladly in a 25w tdp form factor for mobile :)
Have you got a link to the canard story?
 

Vattila

Senior member
Oct 22, 2004
820
1,456
136
I think if you go with a central chipset to connect to all the dies, you are probably going to forgo interconnects between the individual cpu dies and handle it all through the central chipset.

Well, you can move the network-on-chip inside the System Controller die. But this adds routers though, and increases the minimum distance between CPU chiplets to three hops.

9114301_bf09d5b6f0ec5a8077ed985a7a60bca3.png
 
Last edited:
  • Like
Reactions: Zapetu

Abwx

Lifer
Apr 2, 2011
11,884
4,873
136
Lisa said in july that Rome was already sampling, i guess that we ll have some welcomed infos in the next weeks to put to rest all possible wild speculations.

In the meantime i agree with those who pointed that AMD has too little financial ressources to make a radical departure from the current arch, at 5% market share targeted for the end of this year i cant see how they could release a product 100% dedicated for servers, if Zeppelin wasnt reusable for DT it would have been a total disaster for AMD, and even bankruptcy.

With that in mind i think that the plan is to release a 16C chip with 8C/CCX that can be used for the DTs as well in 16/14/12/10C variants while 8C and below will be served by an APU, wich is an indication that a 8C as basis for server doesnt make any sense, or do people think that there will be no 8C APU at 7nm..?
 

kokhua

Member
Sep 27, 2018
86
47
91
The 64 core part will be 8 8 core CCX's, with 4 of them being leech dies, Similar to how the 32 core Threadripper operates now. For most server loads that shouldn't be an issue. There will be a 10-15% uplift in IPC. Drawing roughly the same power as now. Keeps SP3 socket compatibility and everything.

I've considered this too. It's certainly not impossible. But why not just stick with the proven 4-dies approach in NAPLES?
 
  • Like
Reactions: ryan20fun

kokhua

Member
Sep 27, 2018
86
47
91
Yea, cc interconnects for 32+ high-perf cores are normally and cheaply done. For example ThunderX2 and Falkor use a ring. No info about X-Gene3/Ampere. The latency thing creeps back (sometimes very scary).

Btw since people juggle numerous different dies, chiplets, etc. What about a monolithic 64c die? That would be a truly unexpected and performant SKU.

7nm based Fujitsu A64FX packs 48c each with a two huge 512b SIMD units + 4 IO cores. No L3 but still, why not? Check their core/CMG/chip/interconnect architecture.

Thanks for link to A64FX. Cool Fujitsu!

BTW, no one noticed this: my diagram doesn't actually specify the processor architecture. It could be ARM. Or Intel Xeon.
 

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
I've considered this too. It's certainly not impossible. But why not just stick with the proven 4-dies approach in NAPLES?
Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.
 

kokhua

Member
Sep 27, 2018
86
47
91
Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.

Is there any reason why you don't think they will make a 16C die with dual 8-core CCX's, similar to the dual 4-core CCX's in NAPLES? That would seem most logical to me if they went with 4x16C.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.

A dual CCX with 8 cores each (16cores per die) would solve other issues, being a direct upgrade drop in option for Epyc/TR. Though you need either mesh or ringbus for 8 core CCX as oppossed to the current direct connect design. Which makes me wonder what they didn't start with that. That and I still seriously doubt they are going for 16 cores on regular Ryzen 2 desktops, so I think this design ends up with separate dies for Ryzen Desktop and TR/Epyc, plus one more for APU of course.

But I think the AMD CPU business is now robust enough for 3 designs.
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
16,799
7,249
136
What I was imagining with the IO die approach is something like this:

- An 8 core CPU die, with two memory controllers
- A mainstream IO chiplet
- A server IO chiplet
- Navi 10 GPU, with optional GDDR6 controller

Server: 4 DC/8 SC CPU dies plus the server IO chiplet, maybe later offer a BGA model with 16 channels
TR: 4 SC/4 SC + 4 0C CPU dies plus the server IO chiplet
CPU: 1 DC CPU die plus the mainstream IO chiplet, with a 2 die option later
APU: 1 DC CPU die (cut down for the most part) plus the Navi 10 GPU and the mainstream IO chiplet

The Navi 10 GPU with the GDDR6 enabled would then be reused for the low end discrete GPU.
 

beginner99

Diamond Member
Jun 2, 2009
5,318
1,763
136
With that in mind i think that the plan is to release a 16C chip with 8C/CCX that can be used for the DTs as well in 16/14/12/10C variants while 8C and below will be served by an APU, wich is an indication that a 8C as basis for server doesnt make any sense, or do people think that there will be no 8C APU at 7nm..?

Problem is a 8-core ccx is far more complex than a 4-core one as it will need some Ringbus or such. Much more likely the will stick with 4-core ccx and but 4 of these on die.

And yes I'm pretty sure consumer Zen 2 will be 8-core again. Even AMD is in the money making business and raising core counts again already doesn't make much sense especially because you can't feed 16-cores of dual-channel ddr4. Same for APU where bandwidth is already a huge problem and APUs are for mainstream laptops. Best to be cheap (small) and low-power (just enough cores).

No, core count will not change again already and while more likely I still doubt a 8-core ccx because then the APU would need it's own 4-core ccx. I mean the hype around AMD is always huge and then the released product ends up to be a let-down due to the insane hype. On the otherhand we now next APU will be on 12nm still and 7nm apu will earliest be H1 2020 so by then a 8-core APU could make sense and 8-core ccx is possible. It's one of the low-hanging fruits to get a desktop performance uplift.
 
  • Like
Reactions: Vattila

yuri69

Senior member
Jul 16, 2013
677
1,215
136
Problem is a 8-core ccx is far more complex than a 4-core one as it will need some Ringbus or such. Much more likely the will stick with 4-core ccx and but 4 of these on die.
It seems to be a matter of energy efficiency vs latency tradeoff.

IIRC back in the days UltraSPARC T2 used a crossbar for it's 8c. Also the A64FX liked in a previous post got a their own CCX called CMG which packs 12c(!) connected via a crossbar to its L2.
 

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
Is there any reason why you don't think they will make a 16C die with dual 8-core CCX's, similar to the dual 4-core CCX's in NAPLES? That would seem most logical to me if they went with 4x16C.

A dual CCX with 8 cores each (16cores per die) would solve other issues, being a direct upgrade drop in option for Epyc/TR. Though you need either mesh or ringbus for 8 core CCX as oppossed to the current direct connect design. Which makes me wonder what they didn't start with that. That and I still seriously doubt they are going for 16 cores on regular Ryzen 2 desktops, so I think this design ends up with separate dies for Ryzen Desktop and TR/Epyc, plus one more for APU of course.

But I think the AMD CPU business is now robust enough for 3 designs.

Anything is possible I suppose. And you could be right. I just don't see them changing the topology, going to a new node, and doing architechture changes (Gathering the low hanging fruit) all at the same time. Each one of those has a chance of breaking their cadence. And being on time, meeting their roadmap is vital to establishing trust again in the server segment.
 

french toast

Senior member
Feb 22, 2017
988
825
136
Anything is possible I suppose. And you could be right. I just don't see them changing the topology, going to a new node, and doing architechture changes (Gathering the low hanging fruit) all at the same time. Each one of those has a chance of breaking their cadence. And being on time, meeting their roadmap is vital to establishing trust again in the server segment.
Agreed, I don't think AMD are going to be too ambitious with Rome, the last thing AMD would need is a delayed mess.

I think they will play it safe.
 

lixlax

Senior member
Nov 6, 2014
204
196
116
I agree with the last posts. The chiplet design seems just too much too soon and we have to remember that AMD is still much smaller and money constrained compared to Intel (to risk that much).

If Rome is indeed 64 core part then it most likely contains 4 dies with each having 4 four-core CCX's.
But as for AM4 it will be harder to guess:
1) It could be the same 16c die as Rome.
2) It could be similar 8c die as we have now but with improved IPC, frequency and latencies.
3) It could be a separate 12 core die with either 2x6c CCX's or 3x4c CCX's.

Why I see the option 3 possible is because of the fact that in the roadmaps from ~2 years ago they had core counts up to 48 for 7nm Epyc which means they had to be working on a 12c die. After seeing how competitive first gen Epyc was and the issues Intel is having they probably smelt blood and threw resources to build a 16c die. Thats why I think there could be a separate 12c die.
 
  • Like
Reactions: Vattila

krumme

Diamond Member
Oct 9, 2009
5,956
1,596
136
I would asume topology was heavily dependant on trace lenght and therefore diesize? So therefore going to 7nm and keeping a relatively slim core would keep trace length down and therefore enable denser typology aka 8ccx?
 

lixlax

Senior member
Nov 6, 2014
204
196
116
I would asume topology was heavily dependant on trace lenght and therefore diesize? So therefore going to 7nm and keeping a relatively slim core would keep trace length down and therefore enable denser typology aka 8ccx?
It seems to be a general consensus here that with 6c and especially 8c CCX's the count of traces/interconnects is going to be too high/complex. But on the other hand if it's 4 CCX's they'll need 6 infinity fabric links to connect them- at the moment they have 1. I think that'll be easier, but with higher latency penalty (would be terrible for AM4/consumer since the gaming argument is a king there).
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,072
3,897
136
If they have 8 core in a ccx I would expect a ring, the l3 is already sliced per core so it seems logical.
 
Mar 11, 2004
23,444
5,852
146
Something to consider. From what I've gathered, early Zen 2 will top out at 48 core. 64 core EPYC will happen, but I'm not sure it happens as quickly as people think. If its further in the future (or Zen 3) I think that will drastically change the speculation.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
You know, their existing product stack works with 3 Dies.
1) 4 X 4 core CCX, 2 X DDR channels, I/O and glue logic
2) 3 X 4 core CCX, 1 X ~4 CU iGPU, 2 X DDR channles, I/O and glue logic
3) 2 X 4 core CCX, 1 X 12 CU iGPU, 2 X DDR channles, I/O, no glue logic.

Epyc and TR can be made with Dies 1 and 2. Desktop can be made with Dies 1, 2 and 3. Mobile can be made with Die 3. DDR bandwidth is expected to grow to DDR4-3200. That's enough to feed 12 cores/24 threads and mostly keep up with 16/32 (remember, that's 32MB of L3 cache too).

And, if you eliminate #2, you still have a complete stack, but loose a minor competitive point with Intel vis a vis iGPU.
 

kokhua

Member
Sep 27, 2018
86
47
91
You know, their existing product stack works with 3 Dies.
1) 4 X 4 core CCX, 2 X DDR channels, I/O and glue logic
2) 3 X 4 core CCX, 1 X ~4 CU iGPU, 2 X DDR channles, I/O and glue logic
3) 2 X 4 core CCX, 1 X 12 CU iGPU, 2 X DDR channles, I/O, no glue logic.

Epyc and TR can be made with Dies 1 and 2. Desktop can be made with Dies 1, 2 and 3. Mobile can be made with Die 3. DDR bandwidth is expected to grow to DDR4-3200. That's enough to feed 12 cores/24 threads and mostly keep up with 16/32 (remember, that's 32MB of L3 cache too).

And, if you eliminate #2, you still have a complete stack, but loose a minor competitive point with Intel vis a vis iGPU.

Any idea what the die size might be for #3? At 12/14nm or 7nm.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
For 12/14nm, I don't think any of those products are viable with AMD's approach. They seem not to want to go much larger than 200mm^2.

For Die 1, I suspect that it would be 70% larger than the existing 2700 die if implemented at 12nm.
For Die 2, I suspect that it would be under 60% larger due to having the iGPU section be smaller than a CCX.
For Die 3, I suspect that it would be about 50% larger than Raven Ridge at 12nm.

However, at 7nm, as we'e led to believe by various sources, it's roughly 25-30% the size of 12nm for the same mask complexity. ie, if raven ridge was just straight reproduced at 7nm, it would be around 60mm^2 Following that rough approximation, you get the following approximations
1) ~120mm^2
2) ~105mm^2
3) ~95mm^2-~105mm^2 depending on L3 cache size in the CCX units. I propose that they would need to go to 8MB to alleviate memory bus contention for the iGPU.

These are all educated guesses.
 

kokhua

Member
Sep 27, 2018
86
47
91
For 12/14nm, I don't think any of those products are viable with AMD's approach. They seem not to want to go much larger than 200mm^2.

For Die 1, I suspect that it would be 70% larger than the existing 2700 die if implemented at 12nm.
For Die 2, I suspect that it would be under 60% larger due to having the iGPU section be smaller than a CCX.
For Die 3, I suspect that it would be about 50% larger than Raven Ridge at 12nm.

However, at 7nm, as we'e led to believe by various sources, it's roughly 25-30% the size of 12nm for the same mask complexity. ie, if raven ridge was just straight reproduced at 7nm, it would be around 60mm^2 Following that rough approximation, you get the following approximations
1) ~120mm^2
2) ~105mm^2
3) ~95mm^2-~105mm^2 depending on L3 cache size in the CCX units. I propose that they would need to go to 8MB to alleviate memory bus contention for the iGPU.

These are all educated guesses.

Here's my estimate for Die #3:

* Raven Ridge die size at 14nm = 210mm^2
* Assuming wafer price = $8K and yield = 80%, Raven Ridge die costs ~$34.50 to make
* Doubling the core count to 8, increasing L3 cache to 32MB (4MB/core), and adding 1 more CU increases die size by about 1/3 to ~280mm^2
* Divide by density scaling factor of 2.3 (GloFo 14nm to TSMC 7nm) = 280/2.3 = ~122mm^2
* Assuming wafer price = $10K and yield = 70%, manufacturing cost for this die = ~$27.50
 
Last edited:

Vattila

Senior member
Oct 22, 2004
820
1,456
136
See below for a slight evolution of my earlier topology diagram. Assuming the 9-chiplet rumour turns out true, this is my best guess for how the topology may look. "Rome" may have an even more exotic network-on-chip, but this is what I guess a basic solution may look like. The black interconnections show the bare minimum connectivity (3 ports per CPU chiplet), and gray connections improve connectivity further (5 ports).

I have now added (gray) interconnections between the chiplets at the corners. I've drawn them out into a square to mirror the gray inner square. With this addition, the design is fully regular, with all 5 ports connected for all the CPU chiplets, with each chiplet being directly connected to its closest 5 neighbours. Since the design is regular, there is a great number of ways to partition this 64-core CPU effectively.

For example, the inner and outer gray squares each form a 32-core partition with 4 well-connected CPU chiplets, where each chiplet is directly connected to its closest neighbours, with an extra hop to the chiplet at the opposite side. Alternatively, each quadrant forms a fully-connected partition with 3 CPU chiplets, allowing the CPU to be partitioned into two 24-core partitions, plus e.g. two 8-core partitions using the remaining cores. Further alternatives are four fully-connected 16-core partitions (two chiplets), eight 8-core partitions (one chiplet), sixteen 4-core partitions (one CCX), etc. With all these options, the chip should be a dream for virtualisation.

Regarding future extension, it seems obvious where the extra CCXs would go. With the potential for 4 fully-connected CCXs per chiplet, the design would top out at 128 cores per CPU. To feed that beast you would want a doubling in bandwidth, which hopefully will be possible with DDR5 and/or HBM (e.g. imagine 4 GB of HBM mounted on top of each CPU chiplet).

9114301_b035de864d2e055d3966eadf8d2c4c41.png
 
Last edited: