64 core EPYC Rome （Zen2）Architecture Overview？

lightmanek · Nov 8, 2018

My 2c:

I'm in the two chiplets for AM4 camp, here is why! Notice how many memory channels we have on fully populated EPYC Rome (8) and how many 8c chiplets there are? Yes, same number! AMD has widened IF to 64b each direction, which in theory, depending on IF clocks, should be enough for dual channel 128bit DDR4, but it would have to be clocked quite high when paired with DDR4 4000+. This might not be the most power efficient solution, plus if there is L4 cache in IO die, it might be underutilised by just single chiplet.

I therefore suspect, AM4 top end Zen 2 CPU's will have 2 chiplets and up to 16 cores.
BTW this is not excluding AMD from creating castrated 2x4 or even 2x2 cores SKU's, but I think more likely we will get 2x4c at the bottom for R5 family and 1x4c or 1x6c for R3 family with lower official memory speed support.

Topweasel · Nov 8, 2018

The main reason I am sold on the idea of 2 Chiplets for AM4. Is the eventual APU.

AMD has designed themselves out of having to to come up with a new die for each feature set. The 7nm APU will be a Navi based GPU and further a move for AMD towards MCM for video, CPU, and APU. It would make sense that a Video card which has even looser packaging restrictions would be the next step in this. APU's and GPU's will share the same die and future semi custom orders (ie Xbox/PS) they can develop a package even easier using 90% of stuff they are already building.

So anyways. If we know that AMD is unlikely to revert to having 1-3 Monolithic dies after moving away from them. It makes sense that the Zen 2/Navi APU is going to be it's own Chiplet. If so AMD is probably going to want to share the same IO chip between the two. It would make sense if both chips are two chiplets and an IO. Therefore I think it's likely that AMD will have two Chiplets and a possible 16c on Desktop Ryzen.

Ajay · Nov 8, 2018

Atari2600 said:
Why would you do that?

Unless you are in a bandwidth limited scenario (how many of those are latency sensitive) - then you send the request to DRAM at the same time as you send the request to the cache snoop. Whichever returns first valid response you use.

No, you don't send them at the same time. You would slow the CPU to a crawl as the mem controller waited to complete requests. You just work up the cache hierarchy on each cache miss till you actually have to go out to memory. That's why cache exists.

jpiniero · Nov 8, 2018

You know, if AMD did it right they could have an Navi chiplet that could double as the low end discrete GPU. Just put the GDDR6 controller on there, and it would also give OEMs with mobile/BGA machines the option of adding GDDR6 memory.

HurleyBird · Nov 8, 2018

The other possibility for an APU is a monolithic CPU die + GPU-HBM combination ala. Kaby-G.

Tuna-Fish · Nov 8, 2018

lightmanek said:
AMD has widened IF to 64b each direction

Where is this from? Original IF was 16b per direction. Going to PCI-E4 PHYs means they get twice the bandwidth per pin, crank the clock up a little more and they don't need to widen it at all if they don't feel like it. Widening to 32b might be in the cards, but I would be very surprised indeed if it went up to 64b.

Arzachel · Nov 8, 2018

coercitiv said:
Why not leave that for the enthusiast platform, where the 16 core can stretch it's legs with more mem bandwidth and power budget?

Besides demolishing 9900K in benches, there's a price point between top end mainstream and low end enthusiast that they could fill reasonably well with 12 to 16 core parts. Think 400-750$ for a theoretical R9 line.

IRobot23 said:
Yeah, and AMD stated CCX has 8 cores.

Where has AMD stated this?

Zapetu · Nov 8, 2018

IRobot23 said:
Yeah, and AMD stated CCX has 8 cores.

Arzachel said:
Where has AMD stated this?

To be exact, they stated that each chiplet will have 8 cores. They didn't say anything about CCXs though.

There's still some confusion if the PCIe lanes would be located on the chiplets since this slide just states I/O in the middle:

I seems that whoever makes these slides at AMD likes to call PCIe just I/O:

There's very little point to locate PCIe anywhere else than in the I/O die.

I also seems certain that Rome doesn't have any small interposers under each pair of chiplets but that was a long shot in the first place.

No interposers spotted there. As previously stated, AMD using Intel's EMIB, which would not necessarily be visible on the surface, is a big no. Here's the original picture of AMD Rome:

What the actual topology inside each chiplet and inside the I/O die is remains a mystery. The I/O die has well over 20 nodes including 8 chiplets (cache coherent masters, CCMs) and 8 memory controllers. The the PCIe lanes could be divided to 2-8 groups and maybe IFIS need their own nodes even if they use the same wires as PCIe. Then there might be some other IO also. Chiplets could only have 8 cores and 1 InfinityFabric-link and that's only 9 nodes compared to 20+. Still even an 8-core CCX is a hard problem.

DrMrLordX · Nov 8, 2018

coercitiv said:
I don't know, I'm torn. If I had to pick the one piece of information that would lead me to believe an 8 core monolithic APU is in the works, that would be that old rumor that the 4.5Ghz Ryzen sample was being worked on within RTG labs.

Allegedly that wasn't the reason for the chip being at RTG. They had to do some driver reworking or something, to make sure their drivers would work on the new system, or something or other.

PotatoWithEarsOnSide said:
Regarding latency, IF is at MEMCLOCK currently, which is 1600MHz best case currently. If that is decoupled and linked to clockspeed, we'd potentially be looking at 4GHz+, so even if going off die increases the number of cycles (or whatever it is referred to as), then unless we're talking big numbers then overall latency is likely to fall anyway.

IF was linked to memclck originally to reduce the number of clock domains (asynchronous clockspeed = more clock domains = more interdomain latency, I think). My guess is they'll still want them to run synchronously in some fashion, and my guess would be IF will be at 2x memclck (basically, IF speed = mem speed rating).

Vattila said:
Yeah. Although 4-socket is a smaller market, Lisa Su wants to play in high-performance compute. AMD is a participant in government-funded exa-scale research, with systems planned in the not too distant future. 4-socket capability would be another step up in compute density and allow them to compete better in the supercomputer realm.

Looks like Intel is retreating from the 4P space a bit. Cascade Lake-AP will be 2S at most. Granted I do not know if they will feature a different version of Cascade Lake for 4P and up configurations, and the product I'm referencing is largely meant as a replacement for Xeon Phi.

Arzachel · Nov 8, 2018

Zapetu said:
To be exact, they stated that each chiplet will have 8 cores. They didn't say anything about CCXs though.

There's still some confusion if the PCIe lanes would be located on the chiplets since this slide just states I/O in the middle:

I seems that whoever makes these slides at AMD likes to call PCIe just I/O:

There's very little point to locate PCIe anywhere else than in the I/O die.

I also seems certain that Rome doesn't have any small interposers under each pair of chiplets but that was a long shot in the first place.

No interposers spotted there. As previously stated, AMD using Intel's EMIB, which would not necessarily be visible on the surface, is a big no. Here's the original picture of AMD Rome:

What the actual topology inside each chiplet and inside the I/O die is remains a mystery. The I/O die has well over 20 nodes including 8 chiplets (cache coherent masters, CCMs) and 8 memory controllers. The the PCIe lanes could be divided to 2-8 groups and maybe IFIS need their own nodes even if they use the same wires as PCIe. Then there might be some other IO also. Chiplets could only have 8 cores and 1 InfinityFabric-link and that's only 9 nodes compared to 20+. Still even an 8-core CCX is a hard problem.

Yeah, while I'm inclined to think that the whole chiplet is a singular core complex, I don't think AMD has said anything about that. Though the PCI-E link stuff seems like a misinterpretation arising from the slides being very vague.

Atari2600 · Nov 8, 2018

Ajay said:
No, you don't send them at the same time. You would slow the CPU to a crawl as the mem controller waited to complete requests

Why would it wait to complete requests that it could retire as soon as it got a cache hit?

Abwx · Nov 8, 2018

lightmanek said:
I therefore suspect, AM4 top end Zen 2 CPU's will have 2 chiplets and up to 16 cores.
.

That s highly likely given that a 8C part would be competing against both Intel s and own AMD current 8Cs, the latter being dirt cheap nowadays (with Ryzen 1 8Cs at 169-229€ in the EU) and Ryzen 2 following suite in the coming months, all these will be well below 200€$ in Q4 2019.

Best they can do is to up the core count as a mean to make a worthy differentiation and have something to sell in the 300-400€$ segment, i for sure would prefer a 16C clocked at 3.7 base than a 8C clocked at 4.4Ghz, wich the former can do anyway with half of its cores.

Zapetu · Nov 8, 2018

lightmanek said:
AMD has widened IF to 64b each direction

Tuna-Fish said:
Where is this from? Original IF was 16b per direction. Going to PCI-E4 PHYs means they get twice the bandwidth per pin, crank the clock up a little more and they don't need to widen it at all if they don't feel like it. Widening to 32b might be in the cards, but I would be very surprised indeed if it went up to 64b.

I'm guessing that the first one refers to IFOPs (Infinity Fabric On-Package) and the second one to IFIS (Infinity Fabric InterSocket). IFIS is 16b bidirectional (8 transfers per CAKE clock) and has power effeciency of ~11pJ/b (source) or even as low as ~9pJ/b (source). IFOP on the other hand is currently 32b bidirectional (4 transfers per CAKE clock) and has power effeciency of ~2 pJ/b (source). Whatever AMD has widened IFOPs to 64b, I dont'n know. Both techniques use SerDes (Serializer/Deserializer) and obviously IFOP links have lower latency than IFIS links while both have about equal bandwith in AMD EPYC Naples.

Infinity Fabric on-die (in Zeppelin) is a 256-bit bidirectional (parallel) bus running at MemClock which is the same as CAKE clock. Each CAKE serializes one 128-bit SDF request.

lightmanek · Nov 8, 2018

Zapetu said:
I'm guessing that the first one refers to IFOPs (Infinity Fabric On-Package) and the second one to IFIS (Infinity Fabric InterSocket). IFIS is 16b bidirectional (8 transfers per CAKE clock) and has power effeciency of ~11pJ/b (source) or even as low as ~9pJ/b (source). IFOP on the other hand is currently 32b bidirectional (4 transfers per CAKE clock) and has power effeciency of ~2 pJ/b (source). Whatever AMD has widened IFOPs to 64b, I dont'n know. Both techniques use SerDes (Serializer/Deserializer) and obviously IFOP links have lower latency than IFIS links while both have about equal bandwith in AMD EPYC Naples.

Infinity Fabric on-die (in Zeppelin) is a 256-bit bidirectional (parallel) bus running at MemClock which is the same as CAKE clock. Each CAKE serializes one 128-bit SDF request.

Yes, 64b for IFOP. It was mentioned during conference or one of the talks after it and it makes sense as Zen 2 IF has to carry double the data per clock as the whole CPU datapath got doubled to 256bit.

BTW according to twitter post from one of the journalist present at the event, PCIe links are on the IO chip, not the chiplets.

Zapetu · Nov 8, 2018

lightmanek said:
Yes, 64b for IFOP. It was mentioned during conference or one of the talks after it and it makes sense as Zen 2 IF has to carry double the data per clock as the whole CPU datapath got doubled to 256bit.

Makes sense. This is what Ryzen data path currently looks like:

lightmanek said:
BTW according to twitter post from one of the journalist present at the event, PCIe links are on the IO chip, not the chiplets.

Good to know.

Ajay · Nov 8, 2018

Atari2600 said:
Why would it wait to complete requests that it could retire as soon as it got a cache hit?

Even if you retired the request the memory won’t be ready to receive another request for many more cycles and the controller knows this. I suppose you could have logic to remove entries from a queue in the controller on a a cache hit. That sort of logic would have to globally work with all caches which seems to unnecessarily mess with the normal operation of the cache-memory hierarchy.

amd6502 · Nov 8, 2018

Good summary and a total slam dunk from kokhua.

https://forums.anandtech.com/threads/64-core-epyc-rome-（zen2）architecture-overview？.2554453/page-7

With more CCX's it seems like the distributed method in Epyc 1 needed a change. It looks like total winning strategy. eypic 2 should be a good premium high thread hi performance complement for epyc 1. And like kokhua says, 3 dies and gives them a vast range in servers as well as high end desktop.

yuri69 said:
Okay, time for a reality check.

Let's lay down a base context first:
* AMD's R&D budget has been constrained a lot in past 10 years
* AMD started working on znver2 probably in 2013 or 2014
* Naples was the first iteration of the Zen server line - its roots are back to 10h Magny Cours
* Naples use Zeppelin die which is reused in client Ryzen and Threadripper; Zeppelin CCX is reused in Raven APU
* back in summer 2017 there was no Rome but a Starship instead - 48c/96t znver2
* later in 2017 the reliable Canard PC specified "EPYC 2" as 64c, 256MB L3 (4MB per core), PCIe4
* nowadays Charlie@SA has been happy with current Rome config, AMD is confident, etc.

So Rome seems to be a 64c chip or a bit less likely a 48c one.

Now, let's introduce the current "rumor mill" favorite plan aka chiplets. According to a youtuber, the Rome top SKU consists of 9 chips - 1 IO and 8 compute. Details are sparse, but it seems the IO chip would be manufactured at an older process than the compute ones. This idea was further detailed in the diagram posted by OP.

== Naples scaled ==
* double L3 per core - Keep the traffic levels down.
* 8 cores in a CCX - The core interconnect can't probably be a Nehalemish xbar but for instance a SandyBridge's ring bus or whatever. It adds complexity (as in Sandy in 2011) and requires a special CCX for APUs.
* 2 CCXs on a die - This opens up possibilities for a nice TR and scaled down Ryzens. At the same time it keeps the level of complexity down - identical CCXs. Uniform intercore latency for ubiquitous 8c is a nice bonus.
* 4 dies on a package - Simply keep the socket, NUMA mapping, etc. the same.

=> Major investments are: new CCX for APUs, redone intra CCX interconnect and cutting-down Ryzens.

== The chiplets ==
* 8 cores in a CCX - The same issues as above. 8c intercore latency also the same.
* New type of "low latency" interconnect - low latency, a super-high power efficiency (all traffic past L3 goes out of the chip, back to the IO chip, then to RAM) => R&D
* The IO ccMaster - dealing with traffic from all 64c at low latency => R&D
* L4 - R&D
* IO chip itself - can it be reused for ordinary Ryzens - 1x IO + 1x compute? Wasting server-grade IO and L4 for desktop? A different die?

=> Major investments: ???

Now, it's time to lay the Occam's razor. The chiplet solution vs an ordinary one.

Does it make sense to throw away the Magny-Naples know-how given the budget? Mind you, this was really a decision made back in ~2014 (the times when Kaveri struggled with its crippled fw).

Does it make sense to reject znver1 and go to a super radical design which nobody has ever tried in x86 world with an evolution arch revision (znver2)?

Are you sure you can justify the power when going in/out to NB all the time? The same for minimal latency. Can you scale the IO ccMaster, etc.?

Are the benefits worth it? UMA, yields, etc.

The consumer IO module could be integrated on GPU silicon too, with little added area (subtracting the area for a required memory controller and interface, ballpark 40 mm2?) tagged on. This brings the APU to HEDT and premium bleeding edge mobile and PC, while 14/12nm APUs and Pinnacles provide the mainstream volume. Also the igpu+IO module could be co-produced with a 130mm2 die that also is used for a vega 12nm based RX-550 successor. (Or they could go fancy using HMB and 20 CU vega with a ~170mm2 die that would also be be a successor to RX-560. This is less likely.)

hkultala2 · Nov 9, 2018

Spartak said:
8 core won't allow them to outperform Intel on the desktop and is too large / power consuming for mobile. So my bet is on a 6CCX but indeed we'll see once more rumors/news start trickle in.

8 zen2 cores with ~13% ipc increase over zen1 and , ~4.8 GHz turbo clock and ~4.2 GHz base clock would give 9900k a good run for the money, typically losing slightly on single-thread performance but winning slightly on multi-threaded performance.

6-core CCX would increase L3 latency.

amd6502 · Nov 9, 2018

hkultala2 said:
8 zen2 cores with ~13% ipc increase over zen1 and , ~4.8 GHz turbo clock and ~4.2 GHz base clock would give 9900k a good run for the money, typically losing slightly on single-thread performance but winning slightly on multi-threaded performance.

6-core CCX would increase L3 latency.

Once they are ready to significantly optimize for high power (7nm+ ?) or make other significant improvements (zen 2+ or zen 3) I think a consumer monolithic 7nm die to succeed Pinnacles would be worth it. Fabbing availability would be better then too. Or a 7nm quadcore APU that had the ability to drive one of the 8c chiplets (thus being a 3 CCX product in MCM mode).

Also, how expensive would it be to port Zen2 to 12nm? Seems like this node is too economical and important to not update their products on (and the above option could then be done on this instead).

Yotsugi · Nov 9, 2018

amd6502 said:
Also, how expensive would it be to port Zen2 to 12nm?

Why would they ever do that?

moinmoin · Nov 9, 2018

Considering how much Epyc 2's 8-1 layout is made fit the heat layout of Epyc 1 I expect AMD to do the same for Ryzen. So the 2-1 chiplet/IOC configuration is the only way to balance the heat without going monolith. With the chiplets being roughly 3 1/2th part of the Zeppelin die, one each side, leaving the middle available to the IOC. Even at 14nm, with PCIe 4.0 and connections to two chiplets the IOC will take significantly less area than Zeppelin's full server uncore as it now doesn't need to implement any more than the 24 PCIe lanes AM4 supports.

Spartak said:
if the mobile part is already completely different, then why not build the APU up for desktop Ryzen instead of building the MCM down?

Because the APUs currently are AMD's bottom of the barrel offerings that come last every generation. We are still waiting for the 12nm APU, and we are expecting Zen 2 based Ryzen chips in 1H 2019.

Topweasel · Nov 9, 2018

APU's are always last because AMD tries to keep them updated on the latest core tech and GPU tech. So they get finalized a lot later than either of the two on their own.

Which is another benifit of going Chiplet. AMD could cut 6 months of the release cycle of their APU's by going chiplet. They also could have intermediate chips. Like Zen 2 + Vega APU. Before switching to a Zen 2 + Navi and then quickly cycling to a Zen 3 + Navi till they are ready for the next GPU arch.

itsmydamnation · Nov 9, 2018

Vattila said:
Yeah. Although 4-socket is a smaller market, Lisa Su wants to play in high-performance compute. AMD is a participant in government-funded exa-scale research, with systems planned in the not too distant future. 4-socket capability would be another step up in compute density and allow them to compete better in the supercomputer realm.

4S offers no density improvements, infact if your goal is cpu density 4S is worse then the most dense 1S/2S solutions. All 4S is going to give you is very specific workloads (really big DB) and/or specific licensing situations.

Beemster · Nov 9, 2018

kokhua said:
I estimated 250-300 mm^2 without L4. The actual die size suggest maybe there will be an L4 after all. But perhaps not a big one. Say 128-256MB. We can hope.

perhaps eDRAM similar to Centaur technology on Power 9. It could be 256MB or possibly even 512MB on the 14nm process with eDRAM used for the memory hub chip on Power 9. Global owns the process since the IBM Micro/GF merger. Your thoughts?

https://en.wikichip.org/wiki/ibm/centaur

Yotsugi · Nov 10, 2018

Beemster said:
perhaps eDRAM similar to Centaur technology on Power 9. It could be 256MB or possibly even 512MB on the 14nm process with eDRAM used for the memory hub chip on Power 9. Global owns the process since the IBM Micro/GF merger. Your thoughts?

https://en.wikichip.org/wiki/ibm/centaur

Centaur barely has any eDRAM.
Also expensive.

64 core EPYC Rome （Zen2）Architecture Overview？

Senior member

Diamond Member

Lifer

Lifer

Platinum Member

Golden Member

Senior member

Member

Lifer

Senior member

Golden Member

Lifer

Member

Senior member

Member

Lifer

Senior member

Junior Member

Senior member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Member

Golden Member