64 core EPYC Rome （Zen2）Architecture Overview？

dnavas · Nov 12, 2018

Topweasel said:
1. A dedicated team to develop the CPU

There is one arch design and two impl teams per Papermaster.
Going by what we've been told, there should have been an impl team on Ryzen2. From the outside, it looks to have been pretty thin development. Additionally, work done on memory, in 14nm, would not have been wasted effort for Zen2 -- unless the memory subsystem is getting a complete overhaul (the IF is, PCIE is, are they really planning to revamp mem as well -- :gulp: ). Aside from that, though, it seems like we're in agreement -- extra effort was put on Zen2. What's unclear to me is how much this is a result of running the plan vs unexpected road bumps.

maddie · Nov 12, 2018

csbin said:
Naples, Rome, Milan, Zen 4: An Interview with AMD CTO, Mark Papermaster

https://www.anandtech.com/show/1357...-4-an-interview-with-amd-cto-mark-papermaster

IC: With all the memory controllers on the IO die we now have a unified memory design such that the latency from all cores to memory is more consistent?

MP: That’s a nice design – I commented on improved latency and bandwidth. Our chiplet architecture is a key enablement of those improvements.

With regards to the interview, the bolded part of the answer to this question makes a case for a fairly early release of Rome.

"IC: There have been suggestions that because AMD is saying that Rome is coming in 2019 then that means Q4 2019.

MP: We’re not trying to imply any specific quarter or time frame in 2019. If we look at today’s event, it was timed it to launch our MI60 GPU in 7nm which is imminent. We wanted to really share with the industry how we’ve embraced 7nm, and preview what’s coming out very soon with MI60, and really share our approach on CPU on Zen 2 and Rome. We’re not implying any particular time in 2019, but we’ll be forthcoming with that. Even though the GPU is PCIe 3.0 backwards compatible, it helps for a PCIe 4.0 GPU to have a PCIe 4.0 CPU to connect to!

coercitiv · Nov 12, 2018

dnavas said:
So, I'm not someone who has gone through a hardware launch -- my engineer hat has to read "I don't know." And you've written a leading, semi-rhetorical question, so my forum hat reads "interesting question" and my contrary/grumpy-old-man hat screams "YES!" :> Maybe answer with my own question -- were there spins done (completed) between November and Q1?

It wasn't a trick question, I had to check myself to remember the exact numbers: at the New Horizon event in 2016 they showed Ryzen running at 3.4Ghz with no boost enabled. The 1800X launched with base clocks at 3.6Ghz. Make of it what you will, but my take on it is AMD tries their best to keep clocks under wraps, and while the first Zen launch forced them to divulge clocks in order to show the architecture delivered on their promise, now they're not in the same situation. Moreover, they may need to make "last minute" adjustments based on what Intel does - IIRC Intel didn't divulge clocks either

dnavas said:
Another interesting note from the interview -- they showed Rome to support their GPU launch (which is imminent). I don't want to start a war, but AMD is ... let's say they're hurting on the ATI side. My hope is that the strategy to reinvigorate their CPU side is going to reach their GPU team, but thus far....

RTG may hurt (badly) on the consumer side, on the compute side they're very likely to be ok.

JoeRambo · Nov 12, 2018

With so many cores they do need some effective cache coherency solution (cause otherwise snoops will kill performance) and pretty certainly that is some sort of "Directory" based solution just like several posters already predicted. And there are some possible variants that could work:

For chiplets:

1) Fully inclusive L3 of 32MB, lots of benefits and main drawback is 4MB of L3 burned on L2 inclusion. This L3 serves as prefetch destination and cache coherency agent for intra chiplet traffic, as it knows what is inside chiplet.
2) some sort of "eviction" L3, Ryzen like, but properly done in one chunk of 32MB. Has a drawback of still limited prefetch, but otherwise nice characteristics

For IO chip:

a) Directory with cache tags only -> still substantial amount of transistors due to 32*8MB of L3, but obviously saves a ton of area. For inter chiplet and inter chip, it is cache coherency agent
b) Directory with L4 cache -> the problem here is that L4 needs to be huge, 300+MB worth of transistors. Benefits are hard to quantify, as L4 needs to be huge to perform as cache, but in this case it is just keeping chiplet L3 content?
c) No directory and L4 at all. Sounds horrible, but AMD is selling that TR with nasty memory hierarchy, so who knows. Btw this would still make amazing chip with Sub NUMA clustering of 8 nodes -> perfect for VM and cloud.
d) Memory side L4 for each controller

My bet is (1) + a. Would make amazing chip for servers, one that can run heterogeneous workloads ( like some chiplets doing FP and busy in local L3 working set not disturbing whole chip performance ). It can be made greater by any size (d) -> AMD can go with whatever budget of L4 area left and not be constrained by need to hit 256+MB. Even 64MB of L4 total can enhance performance greatly -> like Skylake eDRAM, acting as buffer for memory reads/writes from I/O.

PotatoWithEarsOnSide · Nov 12, 2018

I'm sure I read a few weeks ago that AMD stated that Rome would be this quarter.
I don't recall reading it from multiple sources, nor do I recall any further discussion about it from anyone here.
It seems unlikely though.
If only I could find the bloody source.

DrMrLordX · Nov 12, 2018

Gideon said:
I really hope I'm wrong, but Q2 for Rome might not be that might not be that far off from the truth. Even in the Daniel Bounds interview he said that they'll tell us more about the interconnect in Q1 or Q2. There wouldn't be any reason to keep any info secret, once the product is already on the market.
Hopefully, if Rome takes longer, they still release Ryzen on time.

Schedules do change, and you can see from Papermaster's remarks that they're being cagey about Rome's release date. But they've never actually stated that there will be delays.

Spartak said:
When did they ever state that? Also, schedules change. Charlie has indeed stated Q2 for Rome. In a tweet I can't find right now he stated Matisse for Q3/4, straight from the source.

I have to say I believe this rumored timefrime to be true. People thinking Rome in Q1, Matisse in Q2 will be in for a major disappointment.

Rome 19Q2:
https://www.semiaccurate.com/2018/11/09/amds-rome-is-indeed-a-monster/

First off, why are we worried about Charley? Regardless,

https://wccftech.com/amd-epyc-rome-7-nm-2019-launch-zen-4-zen-5-revealed/

At Computex 2018, AMD announced that they are sampling the second generation, 7nm based EPYC ‘Rome’ processors in 2H 2018. AMD’s CEO, Lisa Su, even held a 7nm EPYC processor in her hands, showcasing it to the audience. The same processors are currently in AMD labs and being evaluated. Now at their one-year anniversary webinar, AMD Senior Vice President and General Manager of Datacenter and Embedded Solutions, Forrest Norrod, reaffirmed that they are going to bring 7nm processors are per scheduled in early 2019.

So "early 2019" is Q2? Not in my book. Also how are they going to keep up the Vermeer release if it takes them that long to release Matisse? See above post by @DisEnchantment

H T C · Nov 12, 2018

From AnandTech Papermaster's interview:

IC: Do the chiplets communicate with each other directly, or is all communication through the IO die?

MP: What we have is an IF link from each CPU chiplet to the IO die.

Does this mean there is zero communication between the chiplets? And what about between cores in the same chiplet?

DrMrLordX · Nov 12, 2018

That would appear to indicate that there are no links between the chiplets in Rome. How the cores are connected to one another within each chiplet is unclear. There may be a ring bus.

Topweasel · Nov 12, 2018

dnavas said:
There is one arch design and two impl teams per Papermaster.
Going by what we've been told, there should have been an impl team on Ryzen2. From the outside, it looks to have been pretty thin development. Additionally, work done on memory, in 14nm, would not have been wasted effort for Zen2 -- unless the memory subsystem is getting a complete overhaul (the IF is, PCIE is, are they really planning to revamp mem as well -- :gulp: ). Aside from that, though, it seems like we're in agreement -- extra effort was put on Zen2. What's unclear to me is how much this is a result of running the plan vs unexpected road bumps.

You missed my point entirely. If they were making any real alterations of Summit for Pinnacle it would have requires more man power. Man power dedicated to Zen 2 and 3.

AMD had a choice make Zen 2 a major arch change, tech change and 7nm. Or dedicate resources to "fixing" Zen for Zen+. Not have a finished arch for 7nm. Not going with a chiplet design. Hell maybe not even doing 7nm for 2019 because without the IO chip AMD couldn't keep to the WSA while switching to TMSC.

There is fine line AMD walked. I don't see any scenario where AMD uses significant money and man hours working on Zen + without completely taking all the steam out of their efforts to further performance to prevent Intel from catching up.

H T C · Nov 12, 2018

Here's a stupid question: do we even know for sure there is 32 MB L3 cache per chiplet?

I ask because it may be possible that there is no L4 cache in the IO chiplet and what there is instead is 256 MB centralized L3 cache instead, divided into 32 MB chunks: 1 per chiplet. If this is the case, then communication between the chiplets is only done within the IO chiplet and there is virtually zero latency there. On the other hand, there is the latency from the communication of CCX - IO - CCX, which is still unknown.

PotatoWithEarsOnSide · Nov 12, 2018

Each chiplet would seem to be too big if they didn't contain l3.

H T C · Nov 12, 2018

PotatoWithEarsOnSide said:
Each chiplet would seem to be too big if they didn't contain l3.

Any idea yet how the 8 cores are arranged within the CCX? I'd guess in a 3 by 3 grid, with the middle one missing:

The CCX chiplets do seem to be a perfect square, no?

itsmydamnation · Nov 12, 2018

H T C said:
Here's a stupid question: do we even know for sure there is 32 MB L3 cache per chiplet?

I ask because it may be possible that there is no L4 cache in the IO chiplet and what there is instead is 256 MB centralized L3 cache instead, divided into 32 MB chunks: 1 per chiplet. If this is the case, then communication between the chiplets is only done within the IO chiplet and there is virtually zero latency there. On the other hand, there is the latency from the communication of CCX - IO - CCX, which is still unknown.

Think this way, moving data is expensive (power), executing data is rather cheap.
Moving inter chip is going to be a lot more expensive then intra chip.
Having to figure out where a line of memory is across many chips would be super expensive ( latency and power)

Now what you have is a trade off for each cache between, power usage, clocks, latency and Size. Each extra cache you add will increase latency. So all things being equal making the L3 twice the size has a cost, having an L4 has a cost. But moving the memory controllers into the I/O die has a cost(inter chip transfer).

It makes the most sense to have as much cache as possible within the chiplet, it make the most sense to track memory (tags) within the last cache within the chiplet ( @JoeRambo covered some options) . It then also makes sense to track which chiplet has what memory within the I/O die. As to weather an L4 is worth it, thats hard to tell, I do agree the I/O die doesn't seem big enough for an L4 cache, but it could still have a very large amount of Sram.

Even if you had an L4 thats fully inclusive on the I/O die that doesn't solve/ make memory requests between chiplets uniform. It infact makes the cache coherency protocol very complex. There will always be an extra cost going inter chiplet, its just were and how you pay for it(latency, power, total memory throughput).

H T C said:
Any idea yet how the 8 cores are arranged within the CCX? I'd guess in a 3 by 3 grid, with the middle one missing:

The CCX chiplets do seem to be a perfect square, no?

They aren't square, There will be more on a chiplet then just the Cores and Cache, there is still the IF I/O, clock gens, management systems, i wonder if memory decryption will happen within a chiplet or in the I/O die etc.

The Core arrangement will be dictated by the interconnect technology,

2x 4core CCX == two squares like Zepplin
A ring bus would equal one big square of 4x2
a full mesh single CCX seems highly unlikely but a 4x2 arrangement seems most likely.

Beemster · Nov 12, 2018

JoeRambo said:
b) Directory with L4 cache -> the problem here is that L4 needs to be huge, 300+MB worth of transistors. Benefits are hard to quantify, as L4 needs to be huge to perform as cache, but in this case it is just keeping chiplet L3 content?

I just realized the 14nm Global process with embedded DRAM as used for IBM Power 9 is on thick SOI. The process would look different on bulk and is more complex. I doubt very much AMD would design their hub chip in partially depleted SOI with thick BOX. So I'd have to say the eDRAM is not likely and a large 256MB SRAM would not fit in the xtra space on the hub chip. Oh well.

note: the control chip on the IBM z14 is about 700mm^2 and about 55% of the chip area is eDRAM, a total of 672MB of L4. The z14 has direct CP to CP links as well though.

https://fuse.wikichip.org/news/956/globalfoundries-14hp-process-a-marriage-of-two-technologies/4/

....snip

"The really big question for us is the fate of IBM’s eDRAM. IBM’s eDRAM technology is a force to be reckoned with and they certainly know it. They exploit this to their advantage by packing a whopping 6 MiB of private L2 as well as a shared 128 MiB of L3 cache on their z14 microprocessor and another 672 MiB of L4 in their control chip!

It’s unclear if IBM’s eDRAM related patents were handed over to GF as part of their deal when they were handed the fabs. And, if this is indeed the case, it would be interesting to see if the technology remains exclusive to IBM or if other foundry customers can access it. If someone like AMD has access to this technology, it could be highly advantageous for them to make use of eDRAM to pack a large amount of additional cache in order to more effectively compete with Intel’s Xeons in the server market through a distinctly unique differentiator."

moinmoin · Nov 12, 2018

dnavas said:
I rather felt that the Ryzen2 release was disappointing.

You keep writing about this. What actual expectations of yours from before the release did Ryzen 2xxx disappoint? It was never going to be more than a glorified refresh. AMD barely talked about it, at first it was just a fine print mention of 14nm+ without any separate column.

Beemster · Nov 12, 2018

moinmoin said:
You keep writing about this. What actual expectations of yours from before the release did Ryzen 2xxx disappoint? It was never going to be more than a glorified refresh. AMD barely talked about it, at first it was just a fine print mention of 14nm+ without any separate column.

the 12nm process was almost entirely a PROCESS upgrade that did not require any substantial re-design. If I remember, the main change (among others) was an optimized fin etch profile and a 4nm reduction in DIBL allowing shorter minimum channel length.

dnavas · Nov 12, 2018

moinmoin said:
You keep writing about this.

Do I? I don't mean to. I blame age

Possibly inability to communicate.
What I mean to keep writing about is that I need better frame decode throughput. THAT I mean to.* Disappointment in Ryzen2 :meh:

What actual expectations of yours from before the release did Ryzen 2xxx disappoint?

I think what I'm trying to say is that IF you assume there was a team of engineers working on Ryzen2 THEN you would be disappointed by what was shipped. There was no tock-tock-tock, low-hanging-fruit, this-was-the-worst-case-scenario follow-ups. The most likely explanation is that the there wasn't a whole team of engineers working on Ryzen2, yeah?

You're talking as if the plan was always to hit 7nm with a brand new architecture. Maybe it was. But why talk about worst-case scenarios -- brand new arch and brand new process -- if the next iteration was planned to be a brand new arch on a brand new process? It's kind of not worst case if it's every case -- something's fishy there. I am assuming that having a team was the original plan, and that that plan changed.

*: to be clear, that's kind of a joke poking fun at myself. I don't mean to keep writing about it, but I recognize that I do, and I can't help myself. [ed]

Beemster · Nov 12, 2018

nextplatform.com

very interesting article about buffered memory for Power 9 (Centaur tech)
...snip "We really want to attach our memory with a SERDES design, with differential signaling. So we really want to get to a SERDES solution to talk to the outside world" It's not clear how much of this if any is relative to I/O hub chip on Rome. Can some circuit / memory system designer comment? This is way out of my field

Tuna-Fish · Nov 12, 2018

H T C said:
Here's a stupid question: do we even know for sure there is 32 MB L3 cache per chiplet?

There absolutely needs to be a large cache on the chiplet, to reduce bandwidth demand on the interface. Today, each L3 in a Ryzen chip (there are two, one for each CCX) offers >350GB/s bandwidth. This is not optional, it's very much in use. To feed 8 cores running at same speeds as today, they'd need >700GB/s. There is no way to put 8x700GB/s interconnects on package and not have it melt. In principle, they could not have L3 on chip and instead offer very large L2 to absorb that bandwidth. However, part of why Ryzen does so well is it's very fast and small L2, so I don't think this is likely. I think each chiplet has a shared L3, that is at least 16MB at the very least. More likely, it will be 32MB, as this matches the "256MB of L3" reported by Canard.

H T C said:
And what about between cores in the same chiplet?

Cores in a CPU technically don't communicate with each other. They communicate with cache. I think that on-chiplet, all communication is managed by the L3, as this is fast enough and reduces spurious traffic.

H T C said:
Any idea yet how the 8 cores are arranged within the CCX? I'd guess in a 3 by 3 grid, with the middle one missing:

I think it's more likely that it will be a 2x4 array, with the L3 in slices in the middle, and IF at one end. To minimize transfer distances, you want the L3 between the cores, and a clear short path from the L3 to the IF controller. This can still be a very square shape, if the width of the cache and the cores is right.

Zapetu · Nov 12, 2018

H T C said:
From AnandTech Papermaster's interview:

IC: Do the chiplets communicate with each other directly, or is all communication through the IO die?

MP: What we have is an IF link from each CPU chiplet to the IO die.

Click to expand...

Does this mean there is zero communication between the chiplets?

As I understood it, each chiplet has only one Infinity Facric link to the I/O die and there are no other (physical) links between chiplets. All requests, data and control goes through I/O die.

H T C said:
And what about between cores in the same chiplet?

Currently CCXs (link) have 4 cores that each have fast access to each other's L3 caches. All four L3 caches inside single CCX are fully connected (link):

Each core complex is connected to Infinity Fabric's (link) SDF (Scalable Data Fabric) using CCM (Cache-Coherent Master) and on-die SDF currently connects (in each Zeppelin) local memory controllers and all IF links (on-package or package-to-package) to other on-die SDFs. CCXs communicate with each other only through SDF and that adds latency compared to direct access (between L3 caches) inside each CCX. Making CCX bigger (adding more cores) increases L3 latency and there are benefits to smaller as well as larger CCXs.

We still don't know if there are two 4C CCXs or one 8C CCX in each chiplet. Ian tried to ask about this but Mark didn't disclose anything.

IC: When one core wants to access the cache of another core, it could have two latencies: when both cores are on the same chiplet, and when the cores are on different chiplets. How is that managed with a potentially bifurcated latency?

MP: I think you’re trying to reconstruct the detailed diagrams that we’ll show you at the product announcement!

If anyone feels like they don't know or have forgotten most about cache coherence and memory hierarchy (like me), here are some good videos about it:
https://www.youtube.com/user/Cjtatmitdotedu/videos

I especially recommend these four:

MIT 6.004 L13: The Memory Hierarchy

MIT 6.004 L14: Hardware Caches

MIT 6.004 L21: Cache Coherence

MIT 6.004 L22: Advanced Multicores

Tuna-Fish · Nov 12, 2018

DrMrLordX said:
How the cores are connected to one another within each chiplet is unclear. There may be a ring bus.

There is no technical reason not to have a ring bus, but since R600 AMD has largely seemed to be "allergic" to ring buses. They seem to prefer to have a narrower crossbar, unlike Intel who build wider buses but are fine with building a ring or a mesh with extra hops.

Based on just this, I'd expect there to be a crossbar between the L3 slices and the cores. Not because it can't be a ring, but just because that's what AMD seems to want to build.

Zapetu · Nov 12, 2018

H T C said:
The CCX chiplets do seem to be a perfect square, no?

Both kokhua and I have calculated the core counts and while mine where initially a little bigger, I later got almost the exact same numbers that he did. Ian also confirmed that Rome is about 1000 mm² total. So using careful pixel counting techniques (images weren't as high-res or sharp as they could have been though) I got the following:

Chiplet: ~ 7.3 mm x ~ 10.0 mm = ~ 73 mm²

I/O Die: ~ 15.0 mm x ~ 28.0 mm = ~ 420 mm²

Total: 8 x ~ 73 mm² + ~ 420 mm² = ~ 1004 mm²

Beemster said:
I just realized the 14nm Global process with embedded DRAM as used for IBM Power 9 is on thick SOI. The process would look different on bulk and is more complex. I doubt very much AMD would design their hub chip in partially depleted SOI with thick BOX. So I'd have to say the eDRAM is not likely and a large 256MB SRAM would not fit in the xtra space on the hub chip. Oh well.

We still don't know which process the I/O die will use but we know that 14HP have been in GloFo's hands since 2015. It's mostly up to IBM if they allow AMD to use it but AMD has known about it long enough to have utilized it in Rome's development process. 14HP has clear benefits (eDRAM) over 14LPP and while it's a more expensive process, we're talking about server chips here. And TR 3k will mostly (at least initially) utilize I/O dies that would otherwise been wasted. I don't know if AMD needs to or even can use 14HP, and how much capacity is there left after IBM's Power9 chips but it's still a viable candidate until otherwise proven. Mark didn't specify which process they would use:

IC: Can you confirm where the parts of Rome are manufactured?

MP: Chiplets on TSMC 7nm, the IO die is on GlobalFoundries 14nm.

It still may very well be that 14LPP is good enough for their needs and there is no need for a large L4 cache. At least then the I/O die would be easier to convert to other processes in the future that don't support eDRAM. There are benefits to both designs but if they need large caches now, 14HP is the best option out there (not counting 420 mm² chip on 7 nm). Later they should be able to lower latencies using packaging techniques similar to EMIB or even active interposers/bridge chiplets, so future designs might not need a large L4 cache. Of course there's still the question that even with 14HP there might not be enough room for large enough L4 cache since all other I/O related things take too much space. Then why even bother with 14HP since 14LPP is much easier to work with.

amd6502 · Nov 12, 2018

exquisitechar said:
The amazing @kokhua predicts on Twitter that there is no L4$. I believe him.

On these forums it seemed the opposite to me. He infers some sort of L4 based on the larger than expected size of the IO die. Where is the tweet?

Speculation: It may be more of a very integrated (into memory controller and CCM) multifunction unit between the memory controller and the chiplets, with main role of cache coherence on the chiplet end, and secondarily as LLC and buffer (request queue and prefetch) on the memory controller end. I can imagine it's large because it's serving as index for the large pool of L3's (16*8MB if the current 2 CCX structure exists, or 8 * 32MB which seems to be the leading speculation---something est'd as being between 128MB-256MB as pool of L3) and as an index of shared entries that exist in duplicate on different CCD's (write locks). Wild speculation there is an L4 ~256MB that can be used for all three functions: buffer, working memory for CCM, and as LLC victim cache to the pool of L3.

HurleyBird · Nov 12, 2018

Tuna-Fish said:
I think it's more likely that it will be a 2x4 array, with the L3 in slices in the middle, and IF at one end. To minimize transfer distances, you want the L3 between the cores, and a clear short path from the L3 to the IF controller. This can still be a very square shape, if the width of the cache and the cores is right.

One issue is that, unless the Zen 2 cores are significantly taller and skinnier, there doesn't seem to be enough space on the die for two columns of 4 unless you kill the 32MB L3 cache rumour, or you stick a bunch of L3 on top of the columns in an asymmetric way. I was able to come up with this:

Which just barely squeezes everything including 32MB of L3 inside the die, and that's with transistor scaling a bit on on the liberal side.

L3 off to one side looks like this:

If the rumour about 32MB of L3 cache is correct, the die probably looks pretty similar to one of these two.

Beemster · Nov 12, 2018

I thought some here estimated about 270mm^2 for all the I/O stuff on the hub chip. If so, then 256MB of 14nm eDRAM L4 will be about 150mm^2 and WILL fit. My question is would AMD design it in a thick box partially depleted SOI process? They used to use such a process at 28nm. So they surely are familiar with thick box SOI. You would think IBM would restrict its use to themselves since the obvious application would compete with Power and the mainframe stuff. But then again IBM management is not known to think too far ahead. Plus Global has to maintain it and probably demanded it be included in the merger deal. But who knows? One question though. If no L4 eDRAM on the hub and no direct chiplet to chiplet connections, what can be so unique about this memory system that Papermaster advertizes as a generational improvement in overall latency to memory. I don't get it.

64 core EPYC Rome （Zen2）Architecture Overview？

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Lifer

Senior member

Lifer

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Member

Diamond Member

Member

Senior member

Member

Golden Member

Member

Golden Member

Member

Senior member

Platinum Member

Member