64 core EPYC Rome （Zen2）Architecture Overview？

DrMrLordX · Sep 28, 2018

kokhua said:
Oh, I see what you mean.

You realize that EPYC has all 8 memory channels enabled, unlike TR2, right? Just sayin.

Yes, that is my point.

Why are you complaining about a workstation/enthusiast CPU on the TR2 platform being inappropriate for server applications when the CPU is not a server CPU, and the platform is not a server platform? It makes no sense.

kokhua · Sep 28, 2018

DrMrLordX said:
Yes, that is my point.

Why are you complaining about a workstation/enthusiast CPU on the TR2 platform being inappropriate for server applications when the CPU is not a server CPU, and the platform is not a server platform? It makes no sense.

Ah, OK. I think I misunderstood you and you misunderstood me.

If you follow the whole chain of comments, you will realize that I am not complaining about TR2 is not being inappropriate for server applications, it obviously wasn't intended for that in the first place. I am saying that a CRIPPLED DESIGN (referring to the earlier mentioned 8C/16T CPU with 1ch DDR4 and also TR2) is not acceptable for server CPUs.

I thought you were deliberately being snarky with your first comment. Apologies.

DrMrLordX · Sep 28, 2018

Oh, okay.

I agree that they will not move to a design that would effectively limit a CPU like Matisse to a single-channel configuration. In the end, their decision to or not to differentiate between server and client CCX design will be related to cost/benefit anaylsis: what can they afford, and how much do they stand to profit from the additional expense of maintaining two CCX designs?

kokhua · Sep 28, 2018

DrMrLordX said:
Oh, okay.

I agree that they will not move to a design that would effectively limit a CPU like Matisse to a single-channel configuration. In the end, their decision to or not to differentiate between server and client CCX design will be related to cost/benefit anaylsis: what can they afford, and how much do they stand to profit from the additional expense of maintaining two CCX designs?

I don't really have a very strong opinion on this. I'm sure AMD knows what's best for them.

I am mainly interested to find out if anyone here has thought about what ROME might look like given the 9 dies conundrum. Since a few people here have dismissed it as BS, I presume they must have at least given it some serious thought. I'd like very much to know what they think is wrong with the diagram.

DrMrLordX · Sep 28, 2018

The only thing I see "wrong" with the Rome layout is that they have moved the memory controller away from the CCX. That effectively prevents them from using the same dice in a Matisse product, unless they intend to go with a "chiplet" design in the client CPUs as well.

Right now AMD has two dice: the CPU dice you get in everything except their APUs, and then the APU dice. And they had to make a separate APU die just to include Vega. I do not think AMD was seriously entertaining the notion of something along the lines of KabyLake-G for their own products.

Anyway, setting aside the APUs, all AMD products are nothing more than constant repetition of the CPU dice. Want more cores? Then add more dice. It allows them to keep the CPUs relatively simple in terms of packaging. The 2990WX is sort of an outlier since it is basically an EPYC with two of the dice not linked to DIMM slots on the board (yay product differentiation). But it's still just four Zen+ dice, regardless.

If we are to believe in the diagram from the OP, now you have a situation where every CPU based on Zen2 will have a minimum of two dice, assuming AMD wants to stick with the "interchangeable parts" strategy. For example, they can ill afford to produce one Zen2 die for Matisse that is one "chiplet" plus a dumbed-down version of the central die from the diagram (one without l4, no SERDES support, and a memory controller with two channels instead of eight). The cost appeal of Zen from the beginning is, again, repetition of the same die design, over and over again. Rome itself would have nine different dice (8 CCX dice and the central l4/IMC die), none of which they could use in client products.

AMD would need to use common CCX dice while altering the central "control" die based on the application. So for example, we get the heavy I/O and major memory bandwidth of the Rome die, but the Matisse die would be smaller and more pedestrian. Then they would link it (Matisse "central"/SoC die) to a single CCX die via IF, meaning a minimum of two dice for any Zen2 product. That introduces the potential for higher memory latency and other "fun" latency effects by moving all the SoC functions to a separate die, connected by IF. And now we also have the potential for high memory latency, the likes of which we currently only see on the 2990WX when attempting to access main memory from a thread pegged to one of the dice with a crippled DDR4 interface.

jpiniero · Sep 28, 2018

DrMrLordX said:
The only thing I see "wrong" with the Rome layout is that they have moved the memory controller away from the CCX.

I don't think this is happening, at least with Rome. The CPU die still has the memory controller. Separating out the memory controller is inevitable though.

kokhua · Sep 28, 2018

DrMrLordX said:
The only thing I see "wrong" with the Rome layout is that they have moved the memory controller away from the CCX. That effectively prevents them from using the same dice in a Matisse product, unless they intend to go with a "chiplet" design in the client CPUs as well.

Right now AMD has two dice: the CPU dice you get in everything except their APUs, and then the APU dice. And they had to make a separate APU die just to include Vega. I do not think AMD was seriously entertaining the notion of something along the lines of KabyLake-G for their own products.

Anyway, setting aside the APUs, all AMD products are nothing more than constant repetition of the CPU dice. Want more cores? Then add more dice. It allows them to keep the CPUs relatively simple in terms of packaging. The 2990WX is sort of an outlier since it is basically an EPYC with two of the dice not linked to DIMM slots on the board (yay product differentiation). But it's still just four Zen+ dice, regardless.

If we are to believe in the diagram from the OP, now you have a situation where every CPU based on Zen2 will have a minimum of two dice, assuming AMD wants to stick with the "interchangeable parts" strategy. For example, they can ill afford to produce one Zen2 die for Matisse that is one "chiplet" plus a dumbed-down version of the central die from the diagram (one without l4, no SERDES support, and a memory controller with two channels instead of eight). The cost appeal of Zen from the beginning is, again, repetition of the same die design, over and over again. Rome itself would have nine different dice (8 CCX dice and the central l4/IMC die), none of which they could use in client products.

AMD would need to use common CCX dice while altering the central "control" die based on the application. So for example, we get the heavy I/O and major memory bandwidth of the Rome die, but the Matisse die would be smaller and more pedestrian. Then they would link it (Matisse "central"/SoC die) to a single CCX die via IF, meaning a minimum of two dice for any Zen2 product. That introduces the potential for higher memory latency and other "fun" latency effects by moving all the SoC functions to a separate die, connected by IF. And now we also have the potential for high memory latency, the likes of which we currently only see on the 2990WX when attempting to access main memory from a thread pegged to one of the dice with a crippled DDR4 interface.

Your comments are right on. That's why it is such a conundrum. I started with the assumption that ROME is 8 CPU dies + 1 I/O die (as the very credible rumors go), and tried to guess what it might look like. I still cannot figure any other way that would make sense. I explained the problems and how I arrived at this diagram in an earlier comment if you haven't seen it.

With regards to Ryzen (Matisse) being a completely different die. That is what I believe. By now, AMD is ready and able to spend a little more R&D money and take some more risk.

kokhua · Sep 28, 2018

jpiniero said:
I don't think this is happening, at least with Rome. The CPU die still has the memory controller. Separating out the memory controller is inevitable though.

Can you explain why you believe that the memory controller will still be on the CPU die in ROME, and why you think separating it out is inevitable?

coercitiv · Sep 28, 2018

DrMrLordX said:
The only thing I see "wrong" with the Rome layout is that they have moved the memory controller away from the CCX. That effectively prevents them from using the same dice in a Matisse product, unless they intend to go with a "chiplet" design in the client CPUs as well.

As soon as I heard the rumors I wondered the same thing, but then I thought the CCX may still keep the MC, only use it in low chip count SKUs. Something along the lines of the current TR2 SKUs having all MCs disabled and all chips being fed externally through IF.

Normally it would seem like a waste, but considering all consumer products would use that silicon area, "wasting" it for high margin products like high-end server chips would not be a problem.

jpiniero · Sep 28, 2018

kokhua said:
Can you explain why you believe that the memory controller will still be on the CPU die in ROME, and why you think separating it out is inevitable?

The latency hit is too much at this point to have it off die. It's inevitable that it will be separated out since they should be able to deal with that eventually, with something like EMIB or the Active Interposer.

kokhua · Sep 28, 2018

jpiniero said:
The latency hit is too much at this point to have it off die. It's inevitable that it will be separated out since they should be able to deal with that eventually, with something like EMIB or the Active Interposer.

I explained how this architecture might achieve low latency earlier, possibly without needing silicon interposer or EMIB.

kokhua · Sep 28, 2018

coercitiv said:
As soon as I heard the rumors I wondered the same thing, but then I thought the CCX may still keep the MC, only use it in low chip count SKUs. Something along the lines of the current TR2 SKUs having all MCs disabled and all chips being fed externally through IF.

Normally it would seem like a waste, but considering all consumer products would use that silicon area, "wasting" it for high margin products like high-end server chips would not be a problem.

You would have to duplicate more than just the MCs. The PCIe, SATA, USB, South Bridge, Management Processor, etc, will all have to be duplicated as well. The CPU die size will double.

coercitiv · Sep 28, 2018

kokhua said:
You would have to duplicate more than just the MCs. The PCIe, SATA, USB, South Bridge, Management Processor, etc, will all have to be duplicated as well. The CPU die size will double.

And it would only be wasted on the high-end server products. It's either that or make separate dies for server, or risk even more by using the "chiplet" approach in consumer products as well. Every option comes with it's own set of risks / restrictions.

dacostafilipe · Sep 28, 2018

IMO, I think that the main issue with the chiplet design would be latency, but this could be reduced by using an interposer instead of connecting everything via the substrate.

It would mean that those chiplets would not be used for Ryzen as those chiplet's would be stripped of all the needed SoC parts (MC,VDD,IO,...), but it would be great for costs and flexibility (CPU only, CPU + GPU, CPU + FPGA, ...)

Obviously, those chiplets would not be used for Ryzen but I could see AMD stripping the CPU-only Ryzen design and only produce a Ryzen 8C APU for the 7nm.

kokhua · Sep 28, 2018

Does anyone still think that my diagram is BS, and would care to point out my mistakes?

Does anyone have any alternative architecture for ROME that they would like to share?

moinmoin · Sep 28, 2018

As for me I'm just highly skeptical about the whole rumor, not the diagram alone. I previously expected chiplets to come, but as part of Zen 3 designs not for Zen 2 already.

DrMrLordX said:
Anyway, setting aside the APUs, all AMD products are nothing more than constant repetition of the CPU dice. Want more cores? Then add more dice. It allows them to keep the CPUs relatively simple in terms of packaging. The 2990WX is sort of an outlier since it is basically an EPYC with two of the dice not linked to DIMM slots on the board (yay product differentiation). But it's still just four Zen+ dice, regardless.

2990WX is not even an outlier, the uncore on Zeppelin is essentially a Swiss knife where never everything is used. Even in the best case of Epyc one of the Serdes is unused per die for optimal routing length. And the lower the food chain the die goes the more of the uncore is gated. 2990WX was an outlier in that a memory controller is being disabled, but Athlon 200GE joined that approach.

kokhua · Sep 28, 2018

moinmoin said:
As for me I'm just highly skeptical about the whole rumor, not the diagram alone.

For me, when the rumor that ROME would move from 4 dies to 9 dies first surfaced, I was not skeptical but troubled. Not skeptical, because multiple sources with impeccable track records said the same thing. Troubled, because I couldn't make sense of the technical trade-offs that would make moving from 4 dies to 9 dies feasible or worthwhile.

AdoredTV's latest video added a few pieces of info: (i) ROME will be a completely new design from the ground up, (ii) AMD will drop NUMA altogether, and (iii) ROME will support 4P configuration.

Piecing all the rumors together, I was finally able to come up with an architecture that explains why AMD would choose to move from 4 dies to 9 dies. If ROME is really like what I described in the diagram, it would give Intel's Cascade Lake and Cooper Lake a very serious run.

Of course, there are a million ways to do the same thing. So in all likelihood, I will be completely wrong.

Glo. · Sep 28, 2018

My biggest question mark: Are CPUs for Rome and Matisse separate design? That is the only thing worth considering right now.

Oh, and Mr. Chia - I never thought your diagram is BS

.

Topweasel · Sep 28, 2018

kokhua said:
I don't understand what you mean by "comm chiplet". But if you are suggesting a monolithic design for Ryzen is better, I totally agree.

IO. And no exact opposite. Ryzen for example only as 24 PCIe lanes available (socket decision) and has a bunch of interconnect stuff to be able to talk to other dies. None of them are useful on Ryzen. So if you remove everything. Memory controller, PCIE connections, pretty much all uncore features. Put them on a IO or communication chip and have the communication chip be only as large as it needs to be for the feature set of Ryzen. Ryzen 4k (because I doubt Zen 2 is chiplet ready imho) would have a much smaller die size for both the core chip and the com chip then lets say Ryzen 3k with Zen 2 which would be a shrunk down SR/PR (with more cores again my opinion). So whatever Zen 2 dies look like Ryzen 4k would have same feature set, but the die size of both chiplets would be smaller than Zen 2.

So then AMD can make a comm IO chip for ThreadRipper (still probably no cache, 4 Memory controllers, 64 Lanes). Then they can make a couple of different ones for EPYC, maybe 1 with 8 controllers, 128 PCIe lanes, no cache. Next one add 256MB of L4, next one add 512MB of L4. This allows AMD to continue to maximize the flexibility of zen dies by market demand. But give them the ability to configure, adjust, and specialize the CPU for the market. The IO Chiplet would also be much less complex than one that includes CPU cores. Meaning easier design. Cheaper and stock would be easier to control. To top it off. It wouldn't even have to be 7nm. This could be how they maintain the WSA, all the chiplets could be 12nm stuff from GF.

kokhua · Sep 28, 2018

Glo. said:
My biggest question mark: Are CPUs for Rome and Matisse separate design? That is the only thing worth considering right now.

My answer is yes. I think AMD is ready for this. If I were AMD, my Zen2 family product lineup might look something like this:

EPYC2/TR3: 8x 8C/16T CPU dies + 1 SC die for 64C SKUs. Can use fewer CPU dies instead of salvaged dies for lower core-count SKUs, no crippling required.

Ryzen Desktop and Notebooks: Different 8C/16T APU, i.e. bring back integrated GPU to mainstream desktop CPUs. A competitive CPU paired with a superior GPU would be a win for AMD. Fuse off features for product segmentation.

All in all, just 3 unique dies are needed:

1. 8C/16T cores-only CPU die
2. SC die
3. 8C/16T APU die

Glo. said:
I never thought your diagram is BS .

I wouldn't be offended one bit even if you did. As long as you have actually thought about the problem and are willing to share with me why you believe it is BS.

kokhua · Sep 28, 2018

Topweasel said:
IO. And no exact opposite. Ryzen for example only as 24 PCIe lanes available (socket decision) and has a bunch of interconnect stuff to be able to talk to other dies. None of them are useful on Ryzen. So if you remove everything. Memory controller, PCIE connections, pretty much all uncore features. Put them on a IO or communication chip and have the communication chip be only as large as it needs to be for the feature set of Ryzen. Ryzen 4k (because I doubt Zen 2 is chiplet ready imho) would have a much smaller die size for both the core chip and the com chip then lets say Ryzen 3k with Zen 2 which would be a shrunk down SR/PR (with more cores again my opinion). So whatever Zen 2 dies look like Ryzen 4k would have same feature set, but the die size of both chiplets would be smaller than Zen 2.

So then AMD can make a comm IO chip for ThreadRipper (still probably no cache, 4 Memory controllers, 64 Lanes). Then they can make a couple of different ones for EPYC, maybe 1 with 8 controllers, 128 PCIe lanes, no cache. Next one add 256MB of L4, next one add 512MB of L4. This allows AMD to continue to maximize the flexibility of zen dies by market demand. But give them the ability to configure, adjust, and specialize the CPU for the market. The IO Chiplet would also be much less complex than one that includes CPU cores. Meaning easier design. Cheaper and stock would be easier to control. To top it off. It wouldn't even have to be 7nm. This could be how they maintain the WSA, all the chiplets could be 12nm stuff from GF.

That's a lot of different dies to make. You know why I disagree. I think desktop is extremely cost sensitive and you can't beat a reasonably small monolithic die when it comes to performance and cost. Time will tell.

Topweasel · Sep 28, 2018

kokhua said:
That's a lot of different dies to make. You know why I disagree. I think desktop is extremely cost sensitive and you can't beat a reasonably small monolithic die when it comes to performance and cost. Time will tell.

Yeah but you can't base the idea of Desktop CPU pricing on Intel's pricing. AMD is making a pretty decent amount of money on competitively priced products with a 200+mm die. They could still do a 16c die in 7nm have it be 150mm die, then a 30-40mm communication die and still be smaller than they are now die wise. Or even go 60mm be slightly larger have that on 12nm and be much cheaper than than doing a 190-200mm die mono die on 7nm.

It several smaller and less complex dies to make, doesn't need to be on same process. While letting them still bin and get great yields on the main core dies. I don't see how they couldn't do that and possibly be more profitable then they currently are now.

kokhua · Sep 28, 2018

Topweasel said:
Yeah but you can't base the idea of Desktop CPU pricing on Intel's pricing. AMD is making a pretty decent amount of money on competitively priced products with a 200+mm die. They could still do a 16c die in 7nm have it be 150mm die, then a 30-40mm communication die and still be smaller than they are now die wise. Or even go 60mm be slightly larger have that on 12nm and be much cheaper than than doing a 190-200mm die mono die on 7nm.

It several smaller and less complex dies to make, doesn't need to be on same process. While letting them still bin and get great yields on the main core dies. I don't see how they couldn't do that and possibly be more profitable then they currently are now.

Believe me, multiple dies will not be cheaper or better performing than monolithic dies of reasonable size (~200mm^2 or less). Also, I think desktop CPUs would be limited 8C because of 2ch memory bandwidth constraints. As a rule of thumb, you need 2.5GB/s per core. You could push 12C but that would be really stretching it to the extreme.

LightningZ71 · Sep 28, 2018

I think we're on to something with this discussion. I see their product stack shaping up like this:
A pair of IO chips with a smaller one for desktop usage and a larger one for TR/Server usage.
A base CPU chip with 8 cores and the needed glue logic to connect them to an I/O chip
An APU with 4-8 cores and an iGPU

Desktop AM4 will be a tiny 7nm CPU chip with a 12nm small I/O chip
TR will be 2-4 CPU chips with a large 12nm I/O chip
Epyc will be 4-8 CPU chips with a large 12nm I/O chip
APUs are stand alone monolithic designs and cover the low to mid part of the market.

maddie · Sep 28, 2018

kokhua said:
My answer is yes. I think AMD is ready for this. If I were AMD, my Zen2 family product lineup might look something like this:

EPYC2/TR3: 8x 8C/16T CPU dies + 1 SC die for 64C SKUs. Can use fewer CPU dies instead of salvaged dies for lower core-count SKUs, no crippling required.

Ryzen Desktop and Notebooks: Different 8C/16T APU, i.e. bring back integrated GPU to mainstream desktop CPUs. A competitive CPU paired with a superior GPU would be a win for AMD. Fuse off features for product segmentation.

All in all, just 3 unique dies are needed:

1. 8C/16T cores-only CPU die
2. SC die
3. 8C/16T APU die

I wouldn't be offended one bit even if you did. As long as you have actually thought about the problem and are willing to share with me why you believe it is BS.

Are you claiming that the only desktop die is an APU class product and that all desktop products will derive from this?

Implied in this is that there will be no 7nm desktop products until late next year, because if I'm not mistaken, there will be no 7nm APU refresh until that timeframe.

64 core EPYC Rome （Zen2）Architecture Overview？

Lifer

Member

Lifer

Member

Lifer

Lifer

Member

Member

Diamond Member

Lifer

Member

Member

Diamond Member

Senior member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Member

Diamond Member

Member

Platinum Member

Diamond Member