64 core EPYC Rome （Zen2）Architecture Overview？

Glo. · Sep 27, 2018

CatMerc said:
That's assuming AMD does the strategy of reusing server dies for consumer again. I am not convinced of this, for several reasons.

However I'm not going to elaborate further since the information is behind a paywall.

Personally I expect server dies to diverge from consumer dies in Zen 2.

Nobody have said that you cannot have 8C/16T design of Matisse CPU, and be limited to 8C/16T design on AM4 platform

.

You can still make AM4 16C/32T CPUs, with the same principle, from Server CPUs

.

jpiniero · Sep 27, 2018

One option I suppose would be to stick 2 memory controllers on the CPU die, even if is 8 core and not 16. Certainly would give them more options such as 32 core or less Epycs could be 4 die instead of 8, if wiring like that isn't such a problem.

Glo. · Sep 27, 2018

jpiniero said:
One option I suppose would be to stick 2 memory controllers on the CPU die, even if is 8 core and not 16. Certainly would give them more options such as 32 core or less Epycs could be 4 die instead of 8, if wiring like that isn't such a problem.

AdoredTV has made a video about those CPU and claimed that his source about 5 dies on EPYC 2 package.

AMD can do: 5 dies: 4+1; 7 dies: 6+1; 9 dies: 8+1, and all of those configs are equally viable options, for segmentation purposes. Now you have to ask. If that another die is important for EPYC2, why would it not be important for: Matisse, and Threadripper CPUs.

If you that chip allows CPUs to be properly fed with work, in multi-chip-module configurations

, there is no reason to NOT use the same design on AM4 and on Threadripper lineups.

If Matisse is really small like 70-80 mm2 - two of them still will be MUCH smaller than Zeppelin design.

And with 80 mm2 we are very close to transistor count of A12 Bionic: 6.9 Bln. Which is almost 50% more than Zeppelin had. So the possibility is that the actual die will be smaller than 6.9 Bln.

kokhua · Sep 27, 2018

Topweasel said:
I don't know if AMD would do MCM on desktop. I don't know If AMD is anywhere near Chiplet level just yet. Which makes a lot of the "You know this is the case" stuff weird. We don't know if Matasse is 8C only or the Rome is X+1. There are a lot of certainties that are not so..

As I said, my diagram is predicated on the rumors that Rome is 64C, 9-die. If the rumors turn out to be false (highly unlikely, imo), then the diagram is meaningless. But if it is, then as I explained, it follows that Ryzen (Matisse) should be a different die.

Topweasel said:
But in a world where AMD is ready to go the chiplet route. The advantage would be die space. Bunches of smaller dies means better yields. Ryzen is very wasteful in that sense. IO that pretty much only exists because it is either A.) only needed because it is the sole chip for a CPU and therefore needlessly duplicated on EPYC. B.) There are things in Ryzen that are wasted as a single chip CPU, things obviously targetting workstation and EPYC loads with multiple chips..

Yes, I agree. This argues even more for Ryzen being a separate die, optimized for desktop only.

Topweasel said:
A comm chiplet would be a lot less complicated. The larger ones would have a few more interconnects and basically just cache. So while yes "multuple dies" it would require less time and effort to manufacture. On Ryzen you would have a really really really small commchiplet in comparison and overall die usuage the two could even get away with being smaller than if all the functionality was just in one die like it is right now (assuming no comm chiplet for Epyc using the same dies). Going chiplet would give AMD exactly what they sought out with Zen and Epyc design. They can work out the downsides of IF and MCM, while still maintaining the complete flexibility of die assignments they have now.

I don't understand what you mean by "comm chiplet". But if you are suggesting a monolithic design for Ryzen is better, I totally agree.

kokhua · Sep 27, 2018

Allow me to give a fuller explanation of how I arrived at the diagram:

Ever since the rumor that ROME will move to a 9-die configuration surfaced in July, I've been trying to make sense of why AMD might decide to do that. I had assumed that smaller CPU dies on an immature 7nm process was the main motivation, and that ROME would follow the same basic architecture of NAPLES except extend it to 8 CPU dies instead of 4. I also imagined the I/O die to be a simple chip containing only the Management/Security processor, South Bridge, and maybe some PCIe lanes.

But the tradeoffs didn't make sense:

1. 8ch DDR4 does not have sufficient bandwidth to feed 64 cores. This relates to 64C, not specifically 9-die.

2. 1ch DDR4 per CPU die is a severe bottleneck. An active core can only access 1ch memory at any time, whether local or multiple hops away. With only 1ch, there is also no opportunity to implement like bank-inteleaving to hide DRAM latency. Memory utilization efficiency will be very poor.

3. Too many IF interconnects (at least 7 per CPU die) are required to have a fully connected system for 8 CPU dies. This presents major power consumption and latency issues.

4. Packaging issues: the complex interconnections between the CPU and I/O dies will practically necessiatate the use of expensive silicon interposers.

Then AdoredTV relesed a video titled "Intel's Epyc Battle, AMD heads to the Moon" on Sep 14. This video offered some clues. In it, Jim mentioned a couple of important things: (a) ROME is a completely new design, and (b) AMD will move away from NUMA altogether. The UMA rumor, in particular, finally allowed me to unravel the conundrum of 9-die ROME. Referring to my diagram:

1. It would mean that all the memory channels must move to the System Controller or SC die (I prefer to call it that) and you now have one big shared memory space served by 8 DDR4 memory controllers. To answer the question of 8ch DDR4 feeding 64C, I added a L4 eDRAM cache and memory compression. This is purely speculation on my part. Collecting all the 8 memory controllers together alone would allow much more flexibility in optimizing the memory controller architecture to improve utilization efficiency. The L4 cache and/or memory compression may not be needed.

2. The problem of 1ch DDR4 per CPU die no longer applies.

3. In this architecture, there is no need for IF links to connect up the CPU dies; the cache coherent network on the SC die takes care of that. The link between the CPU dies and SC die must be very low latency. IF serial links may not be appropriate because the SERDES latency would be directly in the memory data path. A wide, high speed parallel link may be more appropriate. This could simply be a parallel version of IF.

4. Serial IF links are still needed for inter-socket connections for 2P configurations. The appropriate thing to do is to move them to the SC die as well.

5. It then follows that the PCIe Gen4 lanes will also move to the SC die since they share the same multi-mode SERDES as IF.

6. All the duplicated blocks like Management/Security Processor, Server Controller Hub (aka south bridge), etc. gets eliminated, leaving the CPU die with only the cores.

7. There are seemingly many packaging options:

(a) Organic MCM. Since the connection between the CPU die and SC die are now very short and direct (at most 2-3 mm), and are located at the edge of the dies, the drivers can be very low power. Organic MCM might be sufficient for the job. This would be the cheapest option.

(b) Passive interposer. Similar to (a) but using a passive silicon interposer in place of the organic substrate. It would offer better performance than organic MCM but is also much more expensive. Interposer size would exceed reticle limit and require stitching. In this case, the SC die cannot be made too large as it would simply make the interposer even bigger. This means a meaningfully large L4 cache may not be practical. Seems like paying a high price but not getting a commensurate payback in performance. I think this option is overkill.

(c) Active interposer. In this case, the CPU dies will be stacked on top of an active interposer which is also the SC die. The interposer will be large but will not exceed reticle limit. Normally there is no need to use 14nm node to make this interposer. But if the rumor that the SC die uses 14nm is true, then you might as well make full use of the area available by adding a large L4 eDRAM cache. The result would be a monster! The L4 cache would mitigate the increased memory latency resulting from moving the memory controller off-die and the limited bandwidth of 8ch DDR4.

(d) EMIB. Intel's EMIB looks like the perfect packing option for connecting the CPU dies to the SC die. But obviously AMD can't use EMIB. I consulted someone in the packaging business and he told me that there are currently no commercially available alternatives to EMIB. Even though Intel claims that EMIB is theoretically able to accomodate up to 8 bridges per die; in practice it is very diffcult to achieve perfect alignment with more than a couple of bridges; yields will be very bad. Interestingly, AMD has a patent that describes an alternative to EMIB: https://patents.google.com/patent/US20180102338A1/en?oq=20180102338 …. But it is not clear if this is what they will use.

maddie · Sep 27, 2018

kokhua said:
Yes, a better reason is definitely needed for avoiding a separate design for monolithic Ryzen. On the other hand, I can think of several reasons to do it:

1. Ryzen will need to beat, or at least match, Intel's Coffee Lake Refresh on IPC; monolithic design without the latency trade-offs of MCM has a much better chance of doing that.

2. A monolithic die for desktop is not all that difficult or costly to design given Zen's lego-like modular architecture.

3. Whatever one-time cost savings you get by avoiding a monolithic design, you end up paying for it via the MCM cost-adder on relatively high volume desktop CPU's, many times over.

4. The SC (or I/O) die is rumored to be 14nm, that should help fulfill the WSA commitments.

1) This is from the AMD patent "Enabling Interposer-based Disintegration of Multi-core Processors". Section 2.2 quoted in full. There is negligible latency delays when using an SI inteposer. I have seen (~1 ns) from a Xilink paper a few years ago. This is completely different from the MCM approach.

"We consider four technologies to reassemble multiple smaller chips into a larger system: multi-socket boards, multi-chip modules (MCM), silicon interposer-based integration (2.5D), and vertical 3D chip stacking.

Multi-socket:
Symmetric multi-processing (SMP) systems on multiple sockets have been around for decades. The primary downsides are that the bandwidth and latency between sockets is worse than some of the packaging technologies discussed below (resulting in a higher degree of non-uniform memory accesses or NUMA). The limitation comes from a combination of the limit on the per-package pin count and the intrinsic electrical impedance between chips (e.g., C4 bumps, package substrate metal, pins/solder bumps, printed circuit board routing).

Multi-chip Modules:
MCMs take multiple chips and place them on the same substrate within the package. This avoids the pin count limitations of going package-to-package, but the bandwidths and latencies are still constrained by the C4 bumps and substrate routing that connect the silicon die.

Silicon Interposers:
A silicon interposer is effectively a large chip upon which other smaller die can be stacked. The micro-bumps (μbumps) used to connect the individual die to the interposer have greater density than C4 bumps (e.g., ∼9× better assuming 150μm and 50μm pitches for C4 and μbumps), and the impedance across the interposer is identical to conventional on-chip interconnects (both are made with the same process). The main disadvantage is the cost of the additional interposer.

3D Stacking:
Vertical stacking combines multiple chips, where each chip is thinned and implanted with through-silicon vias (TSV) for vertical interconnects. 3D has the highest potential bandwidth, but has the greatest cost and overall process complexity as nearly every die must be thinned and processed for TSVs.

The SMP and MCM approaches are less desirable as they do not provide adequate bandwidth for arbitrary core-to-core cache coherence without exposing significant NUMA effects, and they also have higher energy-per-bit costs compared to the die-stacked options. As such, we do not consider them further. 3D stacking by itself is not (at least at this time) as an attractive of a solution because it is more expensive and complicated, introduces potentially severe thermal issues, and may be an overkill in terms of how much bandwidth it can provide. This leaves us with silicon interposers.

2) Probably true, but remember mask costs for 7nm is considered to be 3x the 14nm ones. What would you rather have? Two 7nm designs and one 12/14nm, or one 7nm and two 12/14nm?

3)Also true, but unless we see the actual cost numbers, this is too vague as to the crossover point. What if you can produce 50 million die before the initially cheaper path becomes more expensive than the other? If the additional cost is low in absolute terms ( a few $/CPU) then this is not a problem considering the lower cost initial investment to be lower risk.

4)Agreed

Glo. · Sep 27, 2018

Also, does anyone remember Gen-Z interconnect?

Who is a founder of this initiative?

Hint: company's name starts with "A", ends with "D" and is three letter acronim

.

Glo. · Sep 27, 2018

kokhua said:
(c) Active interposer. In this case, the CPU dies will be stacked on top of an active interposer which is also the SC die. The interposer will be large but will not exceed reticle limit. Normally there is no need to use 14nm node to make this interposer. But if the rumor that the SC die uses 14nm is true, then you might as well make full use of the area available by adding a large L4 eDRAM cache. The result would be a monster! The L4 cache would mitigate the increased memory latency resulting from moving the memory controller off-die and the limited bandwidth of 8ch DDR4.

The rumor of 14 nm die is just Jim's assumption, not that he got any info on this.

kokhua · Sep 27, 2018

maddie said:
1) This is from the AMD patent "Enabling Interposer-based Disintegration of Multi-core Processors". Section 2.2 quoted in full. There is negligible latency delays when using an SI inteposer. I have seen (~1 ns) from a Xilink paper a few years ago. This is completely different from the MCM approach.

I have read the paper and understand the issue. My point remains that a monolithic design for desktop/notebook CPUs is still the most sensible approach from a performance and cost standpoint. The cheapest Ryzen lists for $99. The recently introduced Athlon 200GE lists for $55. No way an interposer can fit into these price points. For high end server CPUs, not an issue. For the 8C EPYCs, I suspect they don't sell many anyway, so just keep the old 8C/16C NAPLES around.

Besides, AM4 is a small PGA package. Really doubtful it can accommodate multi-die designs.

kokhua · Sep 27, 2018

Glo. said:
The rumor of 14 nm die is just Jim's assumption, not that he got any info on this.

Jim is not the only one that says this. The source is very credible.

kokhua · Sep 27, 2018

Glo. said:
Also, does anyone remember Gen-Z interconnect?

Who is a founder of this initiative?

Hint: company's name starts with "A", ends with "D" and is three letter acronim .

Not sure how this is related but AMD is a member of all three: CCIX, OpenCAPI and Gen-Z. IMO, Gen-Z is far too ambitious and may take many years before anything happens. CCIX is much more likely to see early adoptance.

jpiniero · Sep 27, 2018

Active Interposer is the way to go, but that is a bit of ways off I imagine.

Intel is headed that way too.

kokhua · Sep 27, 2018

BTW, Jim also mentioned a couple of other things in his video:

1. ROME will support 4TB and 32 DIMMs.

I am skeptical about 32 DIMMs because it would mean either 16ch with 2 banks each, or 8ch with 4 banks each. The former case seems not possible because the pin count would be huge and it would not be compatible with the SP3 socket. The latter case presents a loading issue which even if possible would severely limit DRAM clocks. However, 4TB might be possible if you consider that memory compression can give you "effective 4TB" from 2TB of physical memory. I'm taking this rumor with a big pinch of salt.

2. ROME will support 4-socket configuration.

This is possible with my "architecture". In a 2P configuration, there will be 4 IF links between the 2 sockets just like NAPLES. In a 4P configuration, each processor will be connected to the other 3 processors via 6 IF links (2 each). 2 IF links to each processor should be adequate since ROME's IF/XGMI links will be 25MT/s compared with NAPLES 10MT/s.

I can't figure out how to attach pictures here so I'll just link to my twitter post for some diagrams: https://twitter.com/chiakokhua/status/1044621035218161664

Markfw · Sep 27, 2018

kokhua said:
BTW, Jim also mentioned a couple of other things in his video:

1. ROME will support 4TB and 32 DIMMs.

I am skeptical about 32 DIMMs because it would mean either 16ch with 2 banks each, or 8ch with 4 banks each. The former case seems not possible because the pin count would be huge and it would not be compatible with the SP3 socket. The latter case presents a loading issue which even if possible would severely limit DRAM clocks. However, 4TB might be possible if you consider that memory compression can give you "effective 4TB" from 2TB of physical memory. I'm taking this rumor with a big pinch of salt.

2. ROME will support 4-socket configuration.

This is possible with my "architecture". In a 2P configuration, there will be 4 IF links between the 2 sockets just like NAPLES. In a 4P configuration, each processor will be connected to the other 3 processors via 6 IF links (2 each). 2 IF links to each processor should be adequate since ROME's IF/XGMI links will be 25MT/s compared with NAPLES 10MT/s.

I can't figure out how to attach pictures here so I'll just link to my twitter post for some diagrams: https://twitter.com/chiakokhua/status/1044621035218161664

Just save pictures on a server like IMGUR, and then link to that.

Glo. · Sep 27, 2018

kokhua said:
Jim is not the only one that says this. The source is very credible.

Interesting. Thanks.

maddie · Sep 27, 2018

kokhua said:
I have read the paper and understand the issue. My point remains that a monolithic design for desktop/notebook CPUs is still the most sensible approach from a performance and cost standpoint. The cheapest Ryzen lists for $99. The recently introduced Athlon 200GE lists for $55. No way an interposer can fit into these price points. For high end server CPUs, not an issue. For the 8C EPYCs, I suspect they don't sell many anyway, so just keep the old 8C/16C NAPLES around.

Besides, AM4 is a small PGA package. Really doubtful it can accommodate multi-die designs.

You are confusing CPUs and APUs.

What does Athlon have to do with this? That is a totally separate line and almost certainly Ryzen 4 core CPU can be replaced by the APU line. Why do you think they must sell a harvested 4 core from the 8 core line? There is no need for an interposer solution at those price points so therefore not a problem.

It can very easily accommodate a 2 die design. (50-60) + (100-110) mm^2 is smaller than the present die. 2$ interposer + 2$ assembly cost. Very much an interposer can fit this price point. No, I don't see this as too expensive or too large, considering you keep Gloflo happy, have lowest possible 7nm die size and cost and reduce your need for wafers as these will be high demand for a while as only TSMC will be offering this node until late next year.

kokhua · Sep 27, 2018

maddie said:
You are confusing CPUs and APUs.

What does Athlon have to do with this? That is a totally separate line and almost certainly Ryzen 4 core CPU can be replaced by the APU line. Why do you think they must sell a harvested 4 core from the 8 core line? There is no need for an interposer solution at those price points so therefore not a problem.

It can very easily accommodate a 2 die design. (50-60) + (100-110) mm^2 is smaller than the present die. 2$ interposer + 2$ assembly cost. Very much an interposer can fit this price point. No, I don't see this as too expensive or too large, considering you keep Gloflo happy, have lowest possible 7nm die size and cost and reduce your need for wafers as these will be high demand for a while as only TSMC will be offering this node until late next year.

Nothing. Just pointing out the price points. You are very optimistic about interposer and assembly/testing costs; those figures are very, very far from what I think I know. I have consulted experts on NAPLES manufacturing cost and was told the cost for the MCM package (not including the dies) is ~$35. And that is for an organic substrate. But I am not equipped with the facts to debate this, so I'll leave it here. Time will tell.

I am more interested in ROME. Someone told me that a thread was started referring to the diagram I drew. I am curious if someone else has been thinking about the same problem and may have come to a different conclusion than me. There are a million ways to do do this. In all likelihood, AMD will come up with something completely different. So, I am here to learn.

Glo. · Sep 27, 2018

kokhua said:
Nothing. Just pointing out the price points. You are very optimistic about interposer and assembly/testing costs; those figures are very, very far from what I think I know. I have consulted experts on NAPLES manufacturing cost and was told the cost for the MCM package (not including the dies) is ~$35. And that is for an organic substrate. But I am not equipped with the facts to debate this, so I'll leave it here. Time will tell.

I am more interested in ROME. Someone told me that a thread was started referring to the diagram I drew. I am curious if someone else has been thinking about the same problem and may have come to a different conclusion than me. There are a million ways to do do this. In all likelihood, AMD will come up with something completely different. So, I am here to learn.

And what do you need Interposer for, in entry level CPUs, that will be made from single die?

Glo. · Sep 27, 2018

Guys. In this patent the Cores are numbered Core 0, Core 1, ... Core 7. As if they are 8 core CCX's, or something like that.

But isn't Windows or any application that sees THREADS of the CPU see them the same way?
T0, T1, T2... T7.

Aren't we looking at Threads, not cores, despite the vagueness of the patent drawings?

kokhua · Sep 27, 2018

Glo. said:
And what do you need Interposer for, in entry level CPUs, that will be made from single die?

maddie said:
It can very easily accommodate a 2 die design. (50-60) + (100-110) mm^2 is smaller than the present die. 2$ interposer + 2$ assembly cost. Very much an interposer can fit this price point. No, I don't see this as too expensive or too large, considering you keep Gloflo happy, have lowest possible 7nm die size and cost and reduce your need for wafers as these will be high demand for a while as only TSMC will be offering this node until late next year.

Maddie is saying that Ryzen will use the same CPU die as ROME, along with a I/O die designed specifically for desktop, with a silicon interposer.

kokhua · Sep 27, 2018

Glo. said:
Guys. In this patent the Cores are numbered Core 0, Core 1, ... Core 7. As if they are 8 core CCX's, or something like that.

But isn't Windows or any application that sees THREADS of the CPU see them the same way?
T0, T1, T2... T7.

Aren't we looking at Threads, not cores, despite the vagueness of the patent drawings?

Patent? What patent?

Glo. · Sep 27, 2018

kokhua said:
Maddie is saying that Ryzen will use the same CPU die as ROME, along with a I/O die designed specifically for desktop, with a silicon interposer.

Because it can. Imagine a stiation where Intel delivers IceLake architecture on 12 nm process. What AMD is doing with 4th gen Ryzen CPUs?

Adds 16C/32T to the lineup: Interposer, two Matisse dies, and one additional die, just like from Server. Everything else is just segmentation. Where you will not need dual die - there will be one, single die on AM4 package.

kokhua said:
Patent? What patent?

https://www.freshpatents.com/-dt20180823ptan20180239708.php
This patent.

This specific Image. Memory Controllers are part of CPU die, aren't they?

kokhua · Sep 27, 2018

Glo. said:
Because it can. Imagine a stiation where Intel delivers IceLake architecture on 12 nm process. What AMD is doing with 4th gen Ryzen CPUs?

Adds 16C/32T to the lineup: Interposer, two Matisse dies, and one additional die, just like from Server. Everything else is just segmentation. Where you will not need dual die - there will be one, single die on AM4 package.

I don't think multi-die and silicon interposer is the way to go for Ryzen. But never mind, we agree to disagree. Time will tell.

Glo. said:
This specific Image. Memory Controllers are part of CPU die, aren't they?

I haven't seen this patent before. Don't know what it is referring to.

moinmoin · Sep 28, 2018

I'm not sure the chiplet approach of separating cores from uncore and then keeping the latter at 14/12nm makes much sense at the current state. We already have the situation that the cores in Zeppelin are pretty much perfectly optimized for power consumption, being essentially power gated at idle. On the other hand the uncore in Zeppelin is the huge power burner on MCM packages, multiplying the high level of power consumption at idle. Moving the cores to 7nm may improve their power consumption at load, with the new process node also allowing for higher performance. But the high power consumption of the uncore is not being tackled by centralizing it in one massive chiplet at an older node.

kokhua · Sep 28, 2018

moinmoin said:
I'm not sure the chiplet approach of separating cores from uncore and then keeping the latter at 14/12nm makes much sense at the current state. We already have the situation that the cores in Zeppelin are pretty much perfectly optimized for power consumption, being essentially power gated at idle. On the other hand the uncore in Zeppelin is the huge power burner on MCM packages, multiplying the high level of power consumption at idle. Moving the cores to 7nm may improve their power consumption at load, with the new process node also allowing for higher performance. But the high power consumption of the uncore is not being tackled by centralizing it in one massive chiplet at an older node.

Just to clarify: I am *NOT* trying to figure out what is the best theoretical architecture for ROME. I am trying to *GUESS* what ROME might look like given the credible rumors that it is (a) 64C/128T, and (b) sports a 9-die configuration. That's what the diagram is all about.

Having said that, in Naples, the IF interconnects between the 4 Zeppelin dies burn a lot of power in operation. In this architecture, there are no IF links between the CPU dies; the "fabric" is moved into the System Controller (the "Cache Coherent Network"). Since the power hungry SERDES are no longer required, power is saved. However, the CPU-SC links do burn power.

64 core EPYC Rome （Zen2）Architecture Overview？

Diamond Member

Lifer

Diamond Member

Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Member

Member

Lifer

Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Member

Diamond Member

Member

Diamond Member

Member