64 core EPYC Rome （Zen2）Architecture Overview？

kokhua · Sep 27, 2018

Tuna-Fish said:
The diagram is not credible at all, none of this makes any sense.

Hi, I made that drawing. Seeing that it is dismissed as BS, I was wondering if anyone here knows how Rome actually looks like?

Although this is just playful speculation on my part, I did give it serious thought, on the premise that Rome is 64C, 8+1 die. Other than this "architecture", I still cannot think of any other way to explain how a 9-die Rome would be technically feasible without serious compromises. Any help in unraveling the conundrum is greatly appreciated.

Thanks!

CatMerc · Sep 27, 2018

Glo. said:
LOLOLOLOLOLOL.

https://twitter.com/chiakokhua/status/1041487772429705216

And https://twitter.com/peresuslog/status/1041514114789597185

Move on, nothing to see here.

And guys remember, Matisse IS 8C/16T design. That is one thing sure, at this moment.

That's assuming AMD does the strategy of reusing server dies for consumer again. I am not convinced of this, for several reasons.

However I'm not going to elaborate further since the information is behind a paywall.

Personally I expect server dies to diverge from consumer dies in Zen 2.

jpiniero · Sep 27, 2018

CatMerc said:
Personally I expect server dies to diverge from consumer dies in Zen 2.

No way does AMD have the volume to do that.

maddie · Sep 27, 2018

kokhua said:
Hi, I made that drawing. Seeing that it is dismissed as BS, I was wondering if anyone here knows how Rome actually looks like?

Although this is just playful speculation on my part, I did give it serious thought, on the premise that Rome is 64C, 8+1 die. Other than this "architecture", I still cannot think of any other way to explain how a 9-die Rome would be technically feasible without serious compromises. Any help in unraveling the conundrum is greatly appreciated.

Thanks!

Some of the issues with this speculation is that we are faced with is this.

Rome is a X+1 layout, where X can be 8, 6 or maybe even 4. If the 7nm chiplets are mainly cores + cache, then these will be very small (~ 50-60 mm^2) and so the yields should be very high, even on a new node, with little opportunity for harvesting. That means that the lower core counts will have a very high cache to core ratio versus the full 64 core model. this assumes that the central uncore will be unchanged.

What is the situation with the desktop models? If they are the same 7nm die then we will need 2 die / CPU, cores die + desktop uncore die. If they are not the same design, then AMD is splitting the server and desktop lines.

Also, if this happens, then I don't see this happening without interposers because of latency and power considerations. Going through this work and then routing signals through the package substrate seems short sighted to me.

Just my thinking.

edit: The desktop interposer, if it happens will be very small (~ 125-150 mm^2), and have a much lower connection density than HBM, so should not be that costly to implement. The server market can easily support 800-900mm^2 interposers.

kokhua · Sep 27, 2018

maddie said:
Rome is a X+1 layout, where X can be 8, 6 or maybe even 4. If the 7nm chiplets are mainly cores + cache, then these will be very small (~ 50-60 mm^2) and so the yields should be very high, even on a new node, with little opportunity for harvesting. That means that the lower core counts will have a very high cache to core ratio versus the full 64 core model. this assumes that the central uncore will be unchanged.

As I said, this diagram is based on the premise that the rumor that Rome is 8+1 dies is true. Considering the track record of the sources (AdoredTV, etc.), I have no reason to doubt it.

I too have estimated the CPU dies at 50-60mm^2. Very tiny, and I think this is one of the reasons for AMD to move to 8+1 configuration instead of sticking with 4 dies as in Naples. TSMC 7nm is new, so the smaller the die, the higher the yields and the lower the risk. I agree that there will be little opportunity for harvesting. But notice that this architecture allows another option for product segmentation: using fewer CPU dies for lower core count SKUs while retaining the full 8 memory channels and 128 PCIe lanes. For example, you can use 4 fully good CPU dies (+ the SC die), instead of using 8 half-functioning CPU dies for a 32C SKU.

maddie said:
What is the situation with the desktop models? If they are the same 7nm die then we will need 2 die / CPU, cores die + desktop uncore die. If they are not the same design, then AMD is splitting the server and desktop lines.

If Rome is 8+1, it follows logically that Ryzen desktop models must be a different die, even if you assume the architecture is similar to Naples.

maddie said:
Also, if this happens, then I don't see this happening without interposers because of latency and power considerations. Going through this work and then routing signals through the package substrate seems short sighted to me.

I agree that there is the issue of latency due to the memory controller being moved to the SC die. That is why I noted that the CPU-SC die link must be very low latency. One way is to use a wide parallel interface to avoid the IF/XGMI SERDES latency altogether. Notice that the CPU-SC links are very short and direct, on the other of 2-3mm, so it is possible to use very low power drivers with very low voltage swing (<1V), possibly with a cheap organic substrate. Other than the CPU-SC links, everything else is the same as Naples, so I think the power should be manageable. With regards to signal routing, this is far simpler than Naples where you have crissing-crossing IF links between dies. The only issue I can think of is this: if the CPU-SC link is parallel, you might need 50-100 signals for sufficient bandwidth. Given the small CPU die size, you may need a very high density interface and packaging method, something like Intel's EMIB.

CatMerc · Sep 27, 2018

jpiniero said:
No way does AMD have the volume to do that.

They might.

jpiniero · Sep 27, 2018

CatMerc said:
They might.

Unless they are including a GPU somehow, it's not happening. The enthusiast market just isn't anywhere near big enough to justify the expense.

kokhua · Sep 27, 2018

jpiniero said:
Unless they are including a GPU somehow, it's not happening. The enthusiast market just isn't anywhere near big enough to justify the expense.

Think about it: if there are 8 CPU dies, then each one must be 8C/16T with 1 DDR4 memory channel. Such a die is unsuitable for Ryzen. You could use 2 CPU dies + 1 I/O die to make a 16C/32T + 2ch DDR4 MCM package for Ryzen desktop but I don't think it makes sense. Besides, AM4 is a small PGA package, I doubt you can fit a 3-die MCM on it.

Markfw · Sep 27, 2018

kokhua said:
Think about it: if there are 8 CPU dies, then each one must be 8C/16T with 1 DDR4 memory channel. Such a die is unsuitable for Ryzen. You could use 2 CPU dies + 1 I/O die to make a 16C/32T + 2ch DDR4 MCM package for Ryzen desktop but I don't think it makes sense. Besides, AM4 is a small PGA package, I doubt you can fit a 3-die MCM on it.

I don't know what the config will be, but AMD did the 2990WX, and it sure works pretty good, and many said it would be bad. So whatever they do, it will be awesome.

kokhua · Sep 27, 2018

Markfw said:
I don't know what the config will be, but AMD did the 2990WX, and it sure works pretty good, and many said it would be bad. So whatever they do, it will be awesome.

2990WX is artificially crippled because it has only 4 of the 8 DDR4 channels enabled. 2 of the 4 CPU dies do not have direct access to the memory controller and must go through an unnecessary hop to get to memory. This is not acceptable for server CPUs.

Markfw · Sep 27, 2018

kokhua said:
2990WX is artificially crippled because it has only 4 of the 8 DDR4 channels enabled. 2 of the 4 CPU dies do not have direct access to the memory controller and must go through an unnecessary hop to get to memory. This is not acceptable for server CPUs.

I have one. Yes its a little "crippled", but you can't tell aside from benchmarks.

And I use 100% of the CPU power. 24/7/365

kokhua · Sep 27, 2018

It may be fine for your application, but certainly not for server workloads.

jpiniero · Sep 27, 2018

kokhua said:
Think about it: if there are 8 CPU dies, then each one must be 8C/16T with 1 DDR4 memory channel. Such a die is unsuitable for Ryzen.

Unsuitable, no, performance hit yes in low threaded memory bandwidth limited scenarios. That's the tradeoff with going with 8 core dies versus bumping the die to 16 and sticking with dual channel.

kokhua · Sep 27, 2018

jpiniero said:
Unsuitable, no, performance hit yes in low threaded memory bandwidth limited scenarios. That's the tradeoff with going with 8 core dies versus bumping the die to 16 and sticking with dual channel.

I don’t agree, but I suppose you can still make something of a case for the performance tradeoff. What about the MCM on desktop AM4 issue?

maddie · Sep 27, 2018

kokhua said:
As I said, this diagram is based on the premise that the rumor that Rome is 8+1 dies is true. Considering the track record of the sources (AdoredTV, etc.), I have no reason to doubt it.
.................................
If Rome is 8+1, it follows logically that Ryzen desktop models must be a different die, even if you assume the architecture is similar to Naples.
.................................
Given the small CPU die size, you may need a very high density interface and packaging method, something like Intel's EMIB.

I don't agree that it must be a different 7nm CPU die if Rome is max 8+1. Desktop Ryzen could easily be a 1+1 layout using the same 7nm CPU die. In this case, we have 1 cpu die and 2 Uncore die to service the desktop and server markets.
Also if the desktop part is unitary, then it will be twice the size on 7nm, with much lower yields. The opposite with what you need, namely, the cheaper part yields worse than the expensive part.

Seeing that EMIB is Intel and in reality a small bridge interposer. A full size interposer for desktop is not that expensive. I read an IEEE paper where small 65nm interposers cost around US $2/100mm^2. The connection density will be a lot less than needed for HBM so you don't need microbump density soldering, and the Server markets can easily afford larger interposers.

kokhua said:
Think about it: if there are 8 CPU dies, then each one must be 8C/16T with 1 DDR4 memory channel. Such a die is unsuitable for Ryzen. You could use 2 CPU dies + 1 I/O die to make a 16C/32T + 2ch DDR4 MCM package for Ryzen desktop but I don't think it makes sense. Besides, AM4 is a small PGA package, I doubt you can fit a 3-die MCM on it.

I must have missed something. Isn't the uncore die the one having the memory controllers? The CPU die and memory controller ratio are now independent of each other.

jpiniero · Sep 27, 2018

kokhua said:
I don’t agree, but I suppose you can still make something of a case for the performance tradeoff. What about the MCM on desktop AM4 issue?

If the dies are tiny, there should be plenty of room. Hell, Raven Ridge is 240 mm2.

Or they could just create a new socket...

kokhua · Sep 27, 2018

jpiniero said:
If the dies are tiny, there should be plenty of room. Hell, Raven Ridge is 240 mm2.

Or they could just create a new socket...

There's not plenty of room for 2+1 dies on the AM4 package. But OK, I grant you that. What about additional cost of MCM then?

As for new socket to replace AM4, you know it is not going to happen in the Zen 2 generation; AMD has committed to that.

Personally, I don't believe Ryzen will be anything other than a monolithic die. At least for the Zen 2 generation.

kokhua · Sep 27, 2018

maddie said:
I don't agree that it must be a different 7nm CPU die if Rome is max 8+1. Desktop Ryzen could easily be a 1+1 layout using the same 7nm CPU die. In this case, we have 1 cpu die and 2 Uncore die to service the desktop and server markets.
Also if the desktop part is unitary, then it will be twice the size on 7nm, with much lower yields. The opposite with what you need, namely, the cheaper part yields worse than the expensive part.

If you are going to have to tape out another die, might as well do a 7nm 8C/16T monolithic die for desktop, why settle for a separate uncore die for desktop (even though it may be simpler and cheaper to design)? Yes, the die size will be bigger, but at ~100mm^2, yield should not be a problem. (Note: If you use a 2.5X scaling factor on Zeppelin's 213mm^2, you get 85mm^2. I am assuming some things like SERDES don't scale as well). For comparison, Apple's A12 is ~83mm^2. Also by the time Ryzen enters production, 7nm process would have been much more mature.

maddie said:
Seeing that EMIB is Intel and in reality a small bridge interposer. A full size interposer for desktop is not that expensive. I read an IEEE paper where small 65nm interposers cost around US $2/100mm^2. The connection density will be a lot less than needed for HBM so you don't need microbump density soldering, and the Server markets can easily afford larger interposers.

I don't know how much silicon interposers cost aside from the fact that they are not cheap. But I am not ruling out the use of interposers for ROME. I only suspect it may not be necessary in this case. Actually, AMD may have something similar (better?) than EMIB: https://patents.google.com/patent/US20180102338A1/en?oq=20180102338

maddie said:
I must have missed something. Isn't the uncore die the one having the memory controllers? The CPU die and memory controller ratio are now independent of each other.

Sorry for the confusion. In this case, I was referring to 8+1 dies but the uncore is just simple "I/O"; the memory controllers remain on the CPU die. This is or has been the popular assumption.

DrMrLordX · Sep 27, 2018

kokhua said:
It may be fine for your application, but certainly not for server workloads.

You realize that AMD sells this thing called EPYC, right? Just sayin.

kokhua · Sep 27, 2018

DrMrLordX said:
You realize that AMD sells this thing called EPYC, right? Just sayin.

Of course I do. What is your point?

kokhua · Sep 27, 2018

Oh, I see what you mean.

You realize that EPYC has all 8 memory channels enabled, unlike TR2, right? Just sayin.

maddie · Sep 27, 2018

For general interest on the often stated high cost of interposer tech. ( I was wrong about IEEE paper)

https://electroiq.com/2012/12/lifting-the-veil-on-silicon-interposer-pricing/
Lifting the veil on silicon interposer pricing

At the recent Georgia Tech-hosted International Interposer Conference, Matt Nowak of Qualcomm and Nagesh Vordharalli of Altera both pointed to the necessity for interposer costs to reach 1$ per 100mm2 for them to see wide acceptance in the high-volume mobile arena. For Nowak, the standard interposer would be something like ~200mm2 and cost $2. The question that was posed but unanswered was: "Who will make such a $2 interposer?"

Less than a month later, this question began to be answered as several speakers at the year-ending RTI ASIP conference (Architectures for Semiconductor Integration and Packaging) began to lift the veil on silicon interposer pricing.

Sesh Ramaswami, managing director at Applied Materials, showed a cost analysis which resulted in 300mm interposer wafer costs of $500-$650 / wafer. His cost analysis showed the major cost contributors are damascene processing (22%), front pad and backside bumping (20%), and TSV creation (14%).

Ramaswami noted that the dual damascene costs have been optimized for front-end processing, so there is little chance of cost reduction there; whereas cost of backside bump could be lowered by replacing polymer dielectric with oxide, and the cost of TSV formation can be addressed by increasing etch rate, ECD (plating) rate, and increasing PVD step coverage.

Since one can produce ~286 200mm2 die on a 300mm wafer, at $575 (his midpoint cost) per wafer, this results in a $2 200mm2 silicon interposer.

Lionel Cadix, packaging analyst of Yole D

maddie · Sep 27, 2018

kokhua said:
If you are going to have to tape out another die, might as well do a 7nm 8C/16T monolithic die for desktop, why settle for a separate uncore die for desktop (even though it may be simpler and cheaper to design)? Yes, the die size will be bigger, but at ~100mm^2, yield should not be a problem. (Note: If you use a 2.5X scaling factor on Zeppelin's 213mm^2, you get 85mm^2. I am assuming some things like SERDES don't scale as well). For comparison, Apple's A12 is ~83mm^2. Also by the time Ryzen enters production, 7nm process would have been much more mature.

Do you need a better reason? Also don't forget that 7nm will cost more to produce initially, so cheaper to produce as well. AFAIK, AMD still has the WSA agreement with GloFlo. Until real data surfaces then the assumption that they can freely lower their production at that fab is misguided.

kokhua · Sep 27, 2018

maddie said:
Do you need a better reason? Also don't forget that 7nm will cost more to produce initially, so cheaper to produce as well. AFAIK, AMD still has the WSA agreement with GloFlo. Until real data surfaces then the assumption that they can freely lower their production at that fab is misguided.

Yes, a better reason is definitely needed for avoiding a separate design for monolithic Ryzen. On the other hand, I can think of several reasons to do it:

1. Ryzen will need to beat, or at least match, Intel's Coffee Lake Refresh on IPC; monolithic design without the latency trade-offs of MCM has a much better chance of doing that.

2. A monolithic die for desktop is not all that difficult or costly to design given Zen's lego-like modular architecture.

3. Whatever one-time cost savings you get by avoiding a monolithic design, you end up paying for it via the MCM cost-adder on relatively high volume desktop CPU's, many times over.

4. The SC (or I/O) die is rumored to be 14nm, that should help fulfill the WSA commitments.

Topweasel · Sep 27, 2018

kokhua said:
If you are going to have to tape out another die, might as well do a 7nm 8C/16T monolithic die for desktop, why settle for a separate uncore die for desktop (even though it may be simpler and cheaper to design)? Yes, the die size will be bigger, but at ~100mm^2, yield should not be a problem. (Note: If you use a 2.5X scaling factor on Zeppelin's 213mm^2, you get 85mm^2. I am assuming some things like SERDES don't scale as well). For comparison, Apple's A12 is ~83mm^2. Also by the time Ryzen enters production, 7nm process would have been much more mature.

I don't know if AMD would do MCM on desktop. I don't know If AMD is anywhere near Chiplet level just yet. Which makes a lot of the "You know this is the case" stuff weird. We don't know if Matasse is 8C only or the Rome is X+1. There are a lot of certainties that are not so.

But in a world where AMD is ready to go the chiplet route. The advantage would be die space. Bunches of smaller dies means better yields. Ryzen is very wasteful in that sense. IO that pretty much only exists because it is either A.) only needed because it is the sole chip for a CPU and therefore needlessly duplicated on EPYC. B.) There are things in Ryzen that are wasted as a single chip CPU, things obviously targetting workstation and EPYC loads with multiple chips.

A comm chiplet would be a lot less complicated. The larger ones would have a few more interconnects and basically just cache. So while yes "multuple dies" it would require less time and effort to manufacture. On Ryzen you would have a really really really small commchiplet in comparison and overall die usuage the two could even get away with being smaller than if all the functionality was just in one die like it is right now (assuming no comm chiplet for Epyc using the same dies). Going chiplet would give AMD exactly what they sought out with Zen and Epyc design. They can work out the downsides of IF and MCM, while still maintaining the complete flexibility of die assignments they have now.

64 core EPYC Rome （Zen2）Architecture Overview？

Member

Golden Member

Lifer

Diamond Member

Member

Golden Member

Lifer

Member

Moderator Emeritus, Elite Member

Member

Moderator Emeritus, Elite Member

Member

Lifer

Member

Diamond Member

Lifer

Member

Member

Lifer

Member

Member

Diamond Member

Diamond Member

Member

Diamond Member