64 core EPYC Rome （Zen2）Architecture Overview？

yuri69 · Oct 1, 2018

kokhua said:
Again, this is not black magic or anything new. Ampere's just released ARM Server Processor sports a very similar architecture, except it's monolithic (https://amperecomputing.com/wp-content/uploads/2018/02/ampere-product-brief.pdf). AMD certainly does not lack expertise in this.

Yea, cc interconnects for 32+ high-perf cores are normally and cheaply done. For example ThunderX2 and Falkor use a ring. No info about X-Gene3/Ampere. The latency thing creeps back (sometimes very scary).

Btw since people juggle numerous different dies, chiplets, etc. What about a monolithic 64c die? That would be a truly unexpected and performant SKU.

7nm based Fujitsu A64FX packs 48c each with a two huge 512b SIMD units + 4 IO cores. No L3 but still, why not? Check their core/CMG/chip/interconnect architecture.

krumme · Oct 1, 2018

Canard have historically been right.
8ccx for 64c it is.
I wouldn't bet on some beefy ipc uplift then. To keep dies size, and socket compatibility we also know will be there, there must be a very slim chance of wider arch or tdp would go through the roof. Even if it was 225w. 64c running 256b wide fpu at 180w is a stretch but not impossible. I think that kind of stuff is reserved for Zen 3. Takes time to design. It's a new core and we don't get that after 2 years. I think it's more like 5% ipc uplift in a tuned core where the uncore gets the major overhaul.
Anyways. As gamer I want 8ccx for better latency. And gladly in a 25w tdp form factor for mobile

french toast · Oct 1, 2018

krumme said:
Canard have historically been right.
8ccx for 64c it is.
I wouldn't bet on some beefy ipc uplift then. To keep dies size, and socket compatibility we also know will be there, there must be a very slim chance of wider arch or tdp would go through the roof. Even if it was 225w. 64c running 256b wide fpu at 180w is a stretch but not impossible. I think that kind of stuff is reserved for Zen 3. Takes time to design. It's a new core and we don't get that after 2 years. I think it's more like 5% ipc uplift in a tuned core where the uncore gets the major overhaul.
Anyways. As gamer I want 8ccx for better latency. And gladly in a 25w tdp form factor for mobile

Have you got a link to the canard story?

Vattila · Oct 1, 2018

PeterScott said:
I think if you go with a central chipset to connect to all the dies, you are probably going to forgo interconnects between the individual cpu dies and handle it all through the central chipset.

Well, you can move the network-on-chip inside the System Controller die. But this adds routers though, and increases the minimum distance between CPU chiplets to three hops.

Abwx · Oct 1, 2018

Lisa said in july that Rome was already sampling, i guess that we ll have some welcomed infos in the next weeks to put to rest all possible wild speculations.

In the meantime i agree with those who pointed that AMD has too little financial ressources to make a radical departure from the current arch, at 5% market share targeted for the end of this year i cant see how they could release a product 100% dedicated for servers, if Zeppelin wasnt reusable for DT it would have been a total disaster for AMD, and even bankruptcy.

With that in mind i think that the plan is to release a 16C chip with 8C/CCX that can be used for the DTs as well in 16/14/12/10C variants while 8C and below will be served by an APU, wich is an indication that a 8C as basis for server doesnt make any sense, or do people think that there will be no 8C APU at 7nm..?

kokhua · Oct 1, 2018

scannall said:
The 64 core part will be 8 8 core CCX's, with 4 of them being leech dies, Similar to how the 32 core Threadripper operates now. For most server loads that shouldn't be an issue. There will be a 10-15% uplift in IPC. Drawing roughly the same power as now. Keeps SP3 socket compatibility and everything.

I've considered this too. It's certainly not impossible. But why not just stick with the proven 4-dies approach in NAPLES?

kokhua · Oct 1, 2018

yuri69 said:
Yea, cc interconnects for 32+ high-perf cores are normally and cheaply done. For example ThunderX2 and Falkor use a ring. No info about X-Gene3/Ampere. The latency thing creeps back (sometimes very scary).

Btw since people juggle numerous different dies, chiplets, etc. What about a monolithic 64c die? That would be a truly unexpected and performant SKU.

7nm based Fujitsu A64FX packs 48c each with a two huge 512b SIMD units + 4 IO cores. No L3 but still, why not? Check their core/CMG/chip/interconnect architecture.

Thanks for link to A64FX. Cool Fujitsu!

BTW, no one noticed this: my diagram doesn't actually specify the processor architecture. It could be ARM. Or Intel Xeon.

scannall · Oct 1, 2018

kokhua said:
I've considered this too. It's certainly not impossible. But why not just stick with the proven 4-dies approach in NAPLES?

Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.

kokhua · Oct 1, 2018

scannall said:
Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.

Is there any reason why you don't think they will make a 16C die with dual 8-core CCX's, similar to the dual 4-core CCX's in NAPLES? That would seem most logical to me if they went with 4x16C.

PeterScott · Oct 1, 2018

scannall said:
Cost. You have to consider when this was being layed out. An 8 core CCX is far more versatile, not to mention a lot fewer IF connections than a 16 core CCX would be. IF is a bit of a power hog as well, so you'd have to lower frequencies to get it in a reasonable TDP. To make a seperate 16 core CCX would add more R&D cost, and more mask costs for a limited in number of parts compared to general use.

Staying with just 2 parts, the 8 core CCX's and the APU's, you reduce your R&D, mask costs and time to market. The latter being very important as well.

A dual CCX with 8 cores each (16cores per die) would solve other issues, being a direct upgrade drop in option for Epyc/TR. Though you need either mesh or ringbus for 8 core CCX as oppossed to the current direct connect design. Which makes me wonder what they didn't start with that. That and I still seriously doubt they are going for 16 cores on regular Ryzen 2 desktops, so I think this design ends up with separate dies for Ryzen Desktop and TR/Epyc, plus one more for APU of course.

But I think the AMD CPU business is now robust enough for 3 designs.

jpiniero · Oct 1, 2018

What I was imagining with the IO die approach is something like this:

- An 8 core CPU die, with two memory controllers
- A mainstream IO chiplet
- A server IO chiplet
- Navi 10 GPU, with optional GDDR6 controller

Server: 4 DC/8 SC CPU dies plus the server IO chiplet, maybe later offer a BGA model with 16 channels
TR: 4 SC/4 SC + 4 0C CPU dies plus the server IO chiplet
CPU: 1 DC CPU die plus the mainstream IO chiplet, with a 2 die option later
APU: 1 DC CPU die (cut down for the most part) plus the Navi 10 GPU and the mainstream IO chiplet

The Navi 10 GPU with the GDDR6 enabled would then be reused for the low end discrete GPU.

beginner99 · Oct 2, 2018

Abwx said:
With that in mind i think that the plan is to release a 16C chip with 8C/CCX that can be used for the DTs as well in 16/14/12/10C variants while 8C and below will be served by an APU, wich is an indication that a 8C as basis for server doesnt make any sense, or do people think that there will be no 8C APU at 7nm..?

Problem is a 8-core ccx is far more complex than a 4-core one as it will need some Ringbus or such. Much more likely the will stick with 4-core ccx and but 4 of these on die.

And yes I'm pretty sure consumer Zen 2 will be 8-core again. Even AMD is in the money making business and raising core counts again already doesn't make much sense especially because you can't feed 16-cores of dual-channel ddr4. Same for APU where bandwidth is already a huge problem and APUs are for mainstream laptops. Best to be cheap (small) and low-power (just enough cores).

No, core count will not change again already and while more likely I still doubt a 8-core ccx because then the APU would need it's own 4-core ccx. I mean the hype around AMD is always huge and then the released product ends up to be a let-down due to the insane hype. On the otherhand we now next APU will be on 12nm still and 7nm apu will earliest be H1 2020 so by then a 8-core APU could make sense and 8-core ccx is possible. It's one of the low-hanging fruits to get a desktop performance uplift.

yuri69 · Oct 2, 2018

beginner99 said:
Problem is a 8-core ccx is far more complex than a 4-core one as it will need some Ringbus or such. Much more likely the will stick with 4-core ccx and but 4 of these on die.

It seems to be a matter of energy efficiency vs latency tradeoff.

IIRC back in the days UltraSPARC T2 used a crossbar for it's 8c. Also the A64FX liked in a previous post got a their own CCX called CMG which packs 12c(!) connected via a crossbar to its L2.

scannall · Oct 2, 2018

kokhua said:
Is there any reason why you don't think they will make a 16C die with dual 8-core CCX's, similar to the dual 4-core CCX's in NAPLES? That would seem most logical to me if they went with 4x16C.

PeterScott said:
A dual CCX with 8 cores each (16cores per die) would solve other issues, being a direct upgrade drop in option for Epyc/TR. Though you need either mesh or ringbus for 8 core CCX as oppossed to the current direct connect design. Which makes me wonder what they didn't start with that. That and I still seriously doubt they are going for 16 cores on regular Ryzen 2 desktops, so I think this design ends up with separate dies for Ryzen Desktop and TR/Epyc, plus one more for APU of course.

But I think the AMD CPU business is now robust enough for 3 designs.

Anything is possible I suppose. And you could be right. I just don't see them changing the topology, going to a new node, and doing architechture changes (Gathering the low hanging fruit) all at the same time. Each one of those has a chance of breaking their cadence. And being on time, meeting their roadmap is vital to establishing trust again in the server segment.

french toast · Oct 2, 2018

scannall said:
Anything is possible I suppose. And you could be right. I just don't see them changing the topology, going to a new node, and doing architechture changes (Gathering the low hanging fruit) all at the same time. Each one of those has a chance of breaking their cadence. And being on time, meeting their roadmap is vital to establishing trust again in the server segment.

Agreed, I don't think AMD are going to be too ambitious with Rome, the last thing AMD would need is a delayed mess.

I think they will play it safe.

lixlax · Oct 2, 2018

I agree with the last posts. The chiplet design seems just too much too soon and we have to remember that AMD is still much smaller and money constrained compared to Intel (to risk that much).

If Rome is indeed 64 core part then it most likely contains 4 dies with each having 4 four-core CCX's.
But as for AM4 it will be harder to guess:
1) It could be the same 16c die as Rome.
2) It could be similar 8c die as we have now but with improved IPC, frequency and latencies.
3) It could be a separate 12 core die with either 2x6c CCX's or 3x4c CCX's.

Why I see the option 3 possible is because of the fact that in the roadmaps from ~2 years ago they had core counts up to 48 for 7nm Epyc which means they had to be working on a 12c die. After seeing how competitive first gen Epyc was and the issues Intel is having they probably smelt blood and threw resources to build a 16c die. Thats why I think there could be a separate 12c die.

krumme · Oct 2, 2018

I would asume topology was heavily dependant on trace lenght and therefore diesize? So therefore going to 7nm and keeping a relatively slim core would keep trace length down and therefore enable denser typology aka 8ccx?

lixlax · Oct 2, 2018

krumme said:
I would asume topology was heavily dependant on trace lenght and therefore diesize? So therefore going to 7nm and keeping a relatively slim core would keep trace length down and therefore enable denser typology aka 8ccx?

It seems to be a general consensus here that with 6c and especially 8c CCX's the count of traces/interconnects is going to be too high/complex. But on the other hand if it's 4 CCX's they'll need 6 infinity fabric links to connect them- at the moment they have 1. I think that'll be easier, but with higher latency penalty (would be terrible for AM4/consumer since the gaming argument is a king there).

itsmydamnation · Oct 2, 2018

If they have 8 core in a ccx I would expect a ring, the l3 is already sliced per core so it seems logical.

darkswordsman17 · Oct 2, 2018

Something to consider. From what I've gathered, early Zen 2 will top out at 48 core. 64 core EPYC will happen, but I'm not sure it happens as quickly as people think. If its further in the future (or Zen 3) I think that will drastically change the speculation.

LightningZ71 · Oct 2, 2018

You know, their existing product stack works with 3 Dies.
1) 4 X 4 core CCX, 2 X DDR channels, I/O and glue logic
2) 3 X 4 core CCX, 1 X ~4 CU iGPU, 2 X DDR channles, I/O and glue logic
3) 2 X 4 core CCX, 1 X 12 CU iGPU, 2 X DDR channles, I/O, no glue logic.

Epyc and TR can be made with Dies 1 and 2. Desktop can be made with Dies 1, 2 and 3. Mobile can be made with Die 3. DDR bandwidth is expected to grow to DDR4-3200. That's enough to feed 12 cores/24 threads and mostly keep up with 16/32 (remember, that's 32MB of L3 cache too).

And, if you eliminate #2, you still have a complete stack, but loose a minor competitive point with Intel vis a vis iGPU.

kokhua · Oct 2, 2018

LightningZ71 said:
You know, their existing product stack works with 3 Dies.
1) 4 X 4 core CCX, 2 X DDR channels, I/O and glue logic
2) 3 X 4 core CCX, 1 X ~4 CU iGPU, 2 X DDR channles, I/O and glue logic
3) 2 X 4 core CCX, 1 X 12 CU iGPU, 2 X DDR channles, I/O, no glue logic.

Epyc and TR can be made with Dies 1 and 2. Desktop can be made with Dies 1, 2 and 3. Mobile can be made with Die 3. DDR bandwidth is expected to grow to DDR4-3200. That's enough to feed 12 cores/24 threads and mostly keep up with 16/32 (remember, that's 32MB of L3 cache too).

And, if you eliminate #2, you still have a complete stack, but loose a minor competitive point with Intel vis a vis iGPU.

Any idea what the die size might be for #3? At 12/14nm or 7nm.

LightningZ71 · Oct 2, 2018

For 12/14nm, I don't think any of those products are viable with AMD's approach. They seem not to want to go much larger than 200mm^2.

For Die 1, I suspect that it would be 70% larger than the existing 2700 die if implemented at 12nm.
For Die 2, I suspect that it would be under 60% larger due to having the iGPU section be smaller than a CCX.
For Die 3, I suspect that it would be about 50% larger than Raven Ridge at 12nm.

However, at 7nm, as we'e led to believe by various sources, it's roughly 25-30% the size of 12nm for the same mask complexity. ie, if raven ridge was just straight reproduced at 7nm, it would be around 60mm^2 Following that rough approximation, you get the following approximations
1) ~120mm^2
2) ~105mm^2
3) ~95mm^2-~105mm^2 depending on L3 cache size in the CCX units. I propose that they would need to go to 8MB to alleviate memory bus contention for the iGPU.

These are all educated guesses.

kokhua · Oct 2, 2018

LightningZ71 said:
For 12/14nm, I don't think any of those products are viable with AMD's approach. They seem not to want to go much larger than 200mm^2.

For Die 1, I suspect that it would be 70% larger than the existing 2700 die if implemented at 12nm.
For Die 2, I suspect that it would be under 60% larger due to having the iGPU section be smaller than a CCX.
For Die 3, I suspect that it would be about 50% larger than Raven Ridge at 12nm.

However, at 7nm, as we'e led to believe by various sources, it's roughly 25-30% the size of 12nm for the same mask complexity. ie, if raven ridge was just straight reproduced at 7nm, it would be around 60mm^2 Following that rough approximation, you get the following approximations
1) ~120mm^2
2) ~105mm^2
3) ~95mm^2-~105mm^2 depending on L3 cache size in the CCX units. I propose that they would need to go to 8MB to alleviate memory bus contention for the iGPU.

These are all educated guesses.

Here's my estimate for Die #3:

* Raven Ridge die size at 14nm = 210mm^2
* Assuming wafer price = $8K and yield = 80%, Raven Ridge die costs ~$34.50 to make
* Doubling the core count to 8, increasing L3 cache to 32MB (4MB/core), and adding 1 more CU increases die size by about 1/3 to ~280mm^2
* Divide by density scaling factor of 2.3 (GloFo 14nm to TSMC 7nm) = 280/2.3 = ~122mm^2
* Assuming wafer price = $10K and yield = 70%, manufacturing cost for this die = ~$27.50

Vattila · Oct 2, 2018

See below for a slight evolution of my earlier topology diagram. Assuming the 9-chiplet rumour turns out true, this is my best guess for how the topology may look. "Rome" may have an even more exotic network-on-chip, but this is what I guess a basic solution may look like. The black interconnections show the bare minimum connectivity (3 ports per CPU chiplet), and gray connections improve connectivity further (5 ports).

I have now added (gray) interconnections between the chiplets at the corners. I've drawn them out into a square to mirror the gray inner square. With this addition, the design is fully regular, with all 5 ports connected for all the CPU chiplets, with each chiplet being directly connected to its closest 5 neighbours. Since the design is regular, there is a great number of ways to partition this 64-core CPU effectively.

For example, the inner and outer gray squares each form a 32-core partition with 4 well-connected CPU chiplets, where each chiplet is directly connected to its closest neighbours, with an extra hop to the chiplet at the opposite side. Alternatively, each quadrant forms a fully-connected partition with 3 CPU chiplets, allowing the CPU to be partitioned into two 24-core partitions, plus e.g. two 8-core partitions using the remaining cores. Further alternatives are four fully-connected 16-core partitions (two chiplets), eight 8-core partitions (one chiplet), sixteen 4-core partitions (one CCX), etc. With all these options, the chip should be a dream for virtualisation.

Regarding future extension, it seems obvious where the extra CCXs would go. With the potential for 4 fully-connected CCXs per chiplet, the design would top out at 128 cores per CPU. To feed that beast you would want a doubling in bandwidth, which hopefully will be possible with DDR5 and/or HBM (e.g. imagine 4 GB of HBM mounted on top of each CPU chiplet).

64 core EPYC Rome （Zen2）Architecture Overview？

Senior member

Diamond Member

Senior member

Senior member

Lifer

Member

Member

Golden Member

Member

Platinum Member

Lifer

Diamond Member

Senior member

Golden Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Platinum Member

Member

Platinum Member

Member

Senior member