Speculation: The CCX in Zen 2

realibrad · Nov 5, 2018

beginner99 said:
4 Zen1 cores will obliterate 8 jaguar cores easily especially in single-threaded limited scenarios. clock them at their optimum around 2.5 ghz and they are very power efficient. I admit the gpu part would be somewhat of an issue. Vega 56-like could be around 450mm2 + 96mm2 for 4 zen1 cores = 550mm2 on 14nm. maybe bit smaller on 12nm. For sure doable in 2020 when yields should be excellent.

I remember a graph when zen was fist being tested. I think saying very efficient is if anything an understatement. I remember being shocked at how low the power usage was when it was clocked lower.

Vattila · Nov 22, 2018

Things have changed slightly here: The poll results now show 4-core CCX in the lead with 44%. 6-core is still at 41% (I chalk that down to stubbornness or loss of interest in this thread), and 8-core gained some proponents lately to 15%, presumably with the 8+1 chiplet rumours, speculation and details of Rome becoming public.

It seems that 4-core is now pretty much confirmed, with this tweet from VideoCardz.com:

"2x AMD Eng Sample: 2S1404E2VJUG5_20/14_N (64C 1.4Ghz 800MHz IMC, 64x 512KB L2, 16x 16MB L3)"

https://twitter.com/VideoCardz/status/1065553242518626304?s=19

In other words, there are 16 L3 caches, and as we know the L3 in Zen acts as a crossbar between cores in a CCX (CPU core complex). 64/16 = 4 cores in each CCX.

Now, some interesting topology questions: Are the cores connected to the IO chiplet through a 9-port Infinity Fabric router (all cores connect directly to this router for access to IO and external cores) or, more likely, does each CCX have its own 5-port IF router, with a third 3-port IF router connecting to the IO chiplet (or 6-port IF per CXX in a triangle arrangement), or less likely, just two 5-port IF routers with independent links to the IO chiplet?

HurleyBird · Nov 22, 2018

It does greatly increase the likelihood of a 4-core CCX, but at the same time Sandra (that's what this is going by the readout) doesn't always get every detail right on pre-release hardware.

Vattila · Nov 22, 2018

Here is a back-of-the-envelope schematic, focusing on the topology of the Infinity Fabric routers. It shows one possibility, the triangle arrangement, for each CPU chiplet. Note that I have rotated the IO chip 90 degrees, compared to the block diagram shown by AMD, just to focus on the hierarchy.

Note that there is a lot of routing complexity left out, in particular pertaining to memory and IO. Each of the 8 IF routers shown on IO chiplet has to connect to the SCH (system controller hub), the 4 DCMCs (dual-channel memory controllers), and the PCI-Express ports.

Contrary to connecting all 8 IF routers directly to an IF router for each of these, which would require a lot of wiring, I suspect there is some asymmetry, with the left half of the IO chip (4 IF routers) connected to 2 DCMCs and half the PCI-Express lanes. Then each half is connected through a pair of IF routers with a fat link between them.

Having direct connections from all the 8 IF routers to the SCH (and the Secure Processor) seems overkill, since there it has less stringent requirements for bandwidth and latency. So I suspect this simply has a single connection to one of the IF routers, with all the others having to be route traffic to the SCH through the network between them.

Any thoughts about this will be appreciated.

Vattila · Nov 22, 2018

Here is one hypothetical topology for hooking up the dual-channel memory controllers. In this case, the L4 (prefetch buffer, cache-coherency directory), if any, can be associated with the memory controller, and used to ascertain whether a memory access is needed or should be read from cache in one of the chiplets, and to update stale cache lines in case of writing to memory shared among the caches.

Note that this is a somewhat non-uniform memory architecture, but uniformity can be achieved, like in the L3, by interleaving memory between controllers (instead of allocating separate chunks).

Rome IF topology, memory controllers (speculation).png

naukkis · Nov 22, 2018

Every if link adds latency to memory request, so they want to keep if-links as low as possible. With that kind of topology every access to memory would be slower than slowest in Naples. One direct if-link from chiplet to IO-chip is only sane one, if there's two ccx in chiplet ccx data change would still go through IO-chip. As it does with Zen1, there's no shortcut between CCX.

Vattila · Nov 22, 2018

naukkis said:
With that kind of topology every access to memory would be slower than slowest in Naples. One direct if-link from chiplet to IO-chip is only sane one

Which IF routers do you want to eliminate? You need IF endpoints for the chip-to-chip links. You need an IF endpoint for each CCX. And for each DCMC — unless you have a single IF endpoint for a 4-DCMC complex, but that would be a chokepoint wouldn't it?

naukkis said:
if there's two ccx in chiplet ccx data change would still go through IO-chip. As it does with Zen1, there's no shortcut between CCX.

Yeah. Those CCX-to-CCX links are speculative, and I put them on a separate layer in my drawing. Here is the schematic without those links.

Rome IF topology, memory controllers, without direct CCX links (speculation).png

However, removing the direct CCX-to-CCX links doesn't remove any IF routers/endpoints. Actually, I speculate that each DCMC may need an additional IF endpoint (in between the IF router and the DCMC in my drawing), since it is doubtful that the IF endpoint for the DCMC, as I imagine it in AMD's reusable IP library, can act as a 5-port IF router the way I've drawn it.

Zapetu · Nov 23, 2018

As you know, not all of those IF-links are equal and it might make more sense to mark them in a different way. CCD-to-IOD links are most likely SerDes and have a much higher latency than any of the other links or crossbar switches. AMD likes to make interfaces (CCM (Cache-Coherent Master), UMC (Unified Memory Controller), CAKE (Coherent AMD socKet Extender), IOMS (I/O Master/Slave), NCM (Non-Coherent Master),, CCS (Cache-Coherent Slave), etc.) very clear and hide all the complexity (crossbar switches, topology) inside SDFs (Scalable Data Fabric). There's also SCF (Scalable Control Fabric) but it would be too confusing to add them both in the same diagram.

Anyway, I like the basic idea behind all your diagrams but you have to know all the above to really decipher them correctly. Still the big question is what AMD has done to hide most of the latency of those CCD-to-IOD links. One obvious one is doubling the L3 cache / core but there still must be something else too like L3 (and L2) shadow tags on IO die and maybe even "CCM shadow" interfaces on IOD.

But mostly one crossbar switch in each chiplet and a SerDes interface would be sufficient. There might be something else also but there's not too much room left. As for the IOD it will be a traffic hotspot and while 4 crossbar switches (6 nodes each, as you have drawn) might be enough maybe there would be some other way of doing this since six nodes for one crossbar switch is a lot. 5 should be more okay.

Still very interesting speculation and nice "hand-drawn" diagrams.

Zapetu · Nov 24, 2018

I was in a bit of a hurry when I wrote the last post but I really think that there might be some truth to that IOD is divided in quadrants which are then connected together using some low latency on-die IF-links (as shown in Vattila's diagrams). What I meant by crossbar switches is what AMD has shown us for Raven Ridge:

There are 4 crossbar switches of which one has 5 nodes, two 4 nodes and one 3 nodes. I think that the arrows to GPU should be bidirectional or is there some other way for GPU to do memory wites? I'm pretty sure that it must be an error in the illustration. Anyway, the CCX has the lowest latency as it's connected to the same switch as the memory controllers. IO Sub system and Multimedia Hub seems to be the furthest away from the memory controllers. Maybe they could have added one link between the left upper corner and the right lower corner swithes but that would have made the left upper corner switch to have 6 nodes which is a lot. Internal datapath is 256-bits wide which should be the same for the Zeppelin SoC. I was under the impression that they likely have doubled that (datapath width) for Rome.

Do we have a similar diagram of the Zeppelin SoC (with switches)? I haven't found one yet but maybe we should first try to make one based on all that we know about it. It would be easier to build all the Rome speculation on top of that.

There are still all the IO stuff (PCIe, IFIS, other IO) that is missing from your diagrams, Vattila, but I too think that we should base our topologies on the assumption that there are quadrants (on IOD) that are more or less identical to each other. I know that we should not read too much into that SiSoftware Sandra leak but it only mentioned 256MB of L3 and no L4. Still let's not assume too much but 14HP would still be a new process for AMD while they pretty much master 14LPP by now.

Vattila · Nov 24, 2018

Zapetu said:
What I meant by crossbar switches is what AMD has shown us for Raven Ridge

Yeah. We don't really know how these switches are implemented. Wikipedia's page on Multistage interconnection networks has a nice overview of different implementations of routing. For our speculation purposes I think using the generic "router" from network terminology is appropriate without getting bogged down in implementation details.

Actually, you could replace the central four IF routers in my diagram by a block called "IF transport layer", similar to the AMD block diagrams. Then everything is connected to this block. That said, I think it is interesting to open that block to speculate on any non-uniformity in the topology, i.e. whether some cores are closer to a particular memory controller than others, and how AMD will deal with that (e.g. by memory interleaving).

Vattila · Dec 3, 2018

Here is an update of my Infinity Fabric topology diagram, with IF routing between Dual-Channel Memory Controllers (DCMC), PCI-Express/IF ports, and the System Controller Hub (SCH). I have squeezed the SCH into the heart of the IO chiplet for nice symmetry. I assume the SCH includes the secure processor and all southbridge functionality.

Note that I have now consistently drawn IF endpoints for every IP block (CCX, DCMC, PCI-E, SCH), as well as for inter-chiplet links (although the IF router on the CPU chiplets still has dual-duty as a router and an external endpoint — but you can imagine splitting that one too into a router and an endpoint).

Also note that I have now included Level 4 cache as part of the DCMC. As discussed before, this may act as a last level cache and a prefetch buffer, as well as a directory for cache-coherency (storing tags from all the caches in the CPU chiplets). Memory interleaving may be used to even out memory access latency and hide the non-uniform distance to memory.

Note that the IF routers at the heart of the IO chiplet may in turn be implemented by a multistage interconnection network to make the number of connections feasible.

If nothing else, I think my diagram illustrates the interconnection complexity for a big modular chip like this, hence why a network-on-chip makes sense, and why Infinity Fabric is such a big deal for scalable system architecture.

Vattila · Dec 3, 2018

Here is yet another version. This one combines my first diagram, based on a central L4 cache split into 8 tightly connected slices (one per chiplet), with the IF network for the DCMC, PCI-E/IF and SCH. In this version, there are fewer IF hops to the L4 cache. The L4 cache is shown in gray, which also neatly indicates that it is speculative.

Rome IF topology, central L4 (speculation).png

Vattila · Dec 9, 2018

Here is the Ryzen 3000 version of my IF topology schematic.

PotatoWithEarsOnSide · Dec 9, 2018

That's better than any stick man that I could draw.

maddie · Dec 9, 2018

PotatoWithEarsOnSide said:
That's better than any stick man that I could draw.

Those L4s especially.

DrMrLordX · Dec 9, 2018

I did not know that stick men could have L4 cache. Well played, sir.

Should be interesting to see how your guesses stack up against what we see in reviews.

Vattila · Jun 11, 2019

With the technical information revealed at E3, we now have confirmation that the CCX in Zen 2 is still 4-core. With the big performance gains with Zen 2, it seems the performance fears with this modular configuration have been mostly unfounded, and that an optimised 4-core complex is indeed a balanced building block.

David_McAfee-Next_Horizon_Gaming-3rd_Gen_Ryzen_06092019-page-008%20-%20Copy2.jpg

thigobr · Jun 11, 2019

And now we know chiplets only talk through the I/O die

Soulkeeper · Jun 12, 2019

So the L4 slices are on the chiplets and run at cpu clock ?
Is the IF still tied to the mem speed ?

maddie · Jun 12, 2019

Soulkeeper said:
So the L4 slices are on the chiplets and run at cpu clock ?
Is the IF still tied to the mem speed ?

Yes, but now you can adjust the ratio to some extent.

https://images.anandtech.com/doci/1...g-Ryzen_Deep_Dive_06092019-page-017_575px.jpg

TravisK_DonW-Next_Horizon_Gaming-Ryzen_Deep_Dive_06092019-page-017.jpg

Ajay · Jun 12, 2019

Soulkeeper said:
So the L4 slices are on the chiplets and run at cpu clock ?
Is the IF still tied to the mem speed ?

There is no L4$

amrnuke · Jun 12, 2019

A few stupid questions:

First, what would the benefit of an L4$ be, when the L3$ is 64MB for 12C and 16C, which is massive already (Intel's top chips have 2MB/core - Zen 2 is 4-6MB/core)? Do we know that additional cache beyond the 70+MB total for these higher end chips (or even more than 32MB for the 8C/16T) would help? I know larger L3$ on Intel chips did provide a performance boost, mild, but is there a tipping point for performance, or a limit beyond which L3$ size loses benefit? I only ask because I hear L3$ is expensive, not just cash wise, but also for space.

Second, assuming an additional cache layer with L4$ would be beneficial, why wouldn't it be implemented into the I/O chiplet, and make the I/O chiplet just a little bigger? It seems there is plenty of space for expansion of the I/O.

Third, based on the routing layout we saw, it looks like the power delivery system is already a sizeable portion of the chip, meaning it seems with this layout, 16 cores seems to be the limit. If the power delivery areas were increased 50% to accommodate another chiplet to do 24 cores, it seems there wouldn't be enough room to squeeze the extra chiplet in.

Finally, and I'm sure this has a good reason, if you look at the length of the IF connects between chiplet and I/O, wouldn't it make sense to stretch the I/O die horizontally and make the IF links straight, move the I/O chiplet closer to the CCX (see here for routing map) so that the IF links would be as short as possible between chiplet and I/O? And if it's true that the dark purple at the right of the I/O is for peripherals, wouldn't it make sense to move that to the south side of the I/O chiplet and move the DDR4 memory links to the right side? Again, not a CPU engineer, but in my rudimentary mind, shorter lines may save a nanosecond here or there.

Ajay · Jun 12, 2019

amrnuke said:
A few stupid questions:

First, what would the benefit of an L4$ be, when the L3$ is 64MB for 12C and 16C, which is massive already (Intel's top chips have 2MB/core - Zen 2 is 4-6MB/core)? Do we know that additional cache beyond the 70+MB total for these higher end chips (or even more than 32MB for the 8C/16T) would help? I know larger L3$ on Intel chips did provide a performance boost, mild, but is there a tipping point for performance, or a limit beyond which L3$ size loses benefit? I only ask because I hear L3$ is expensive, not just cash wise, but also for space.

L3$ is expected to grow in the future, but larger caches tend to have higher latency. At some point a larger cache wouldn’t make sense, but designers reach their max xtor budgets well before that. Oh, and L3$ is per CCX, not per core.

Third, based on the routing layout we saw, it looks like the power delivery system is already a sizeable portion of the chip, meaning it seems with this layout, 16 cores seems to be the limit. If the power delivery areas were increased 50% to accommodate another chiplet to do 24 cores, it seems there wouldn't be enough room to squeeze the extra chiplet in.

I can’t see anything more than 16 cores coming to AM4. Power delivery and signal interference would seem to become overwhelming problems. We shall see though.

amrnuke · Jun 12, 2019

Ajay said:
L3$ is expected to grow in the future, but larger caches tend to have higher latency. At some point a larger cache wouldn’t make sense, but designers reach their max xtor budgets well before that. Oh, and L3$ is per CCX, not per core.

This brings up another question. Each chiplet has two 4-core CCX, each 4-core CCX has access to the L3$ on that CCX, but will it have access to the L3$ on the adjacent CCX on the same chiplet? Even if not, for each CCX, there is 16MB L3$, which is a substantial increase over anything we've seen on mainstream consumer CPUs before.

moinmoin · Jun 12, 2019

amrnuke said:
A few stupid questions:

First, what would the benefit of an L4$ be, when the L3$ is 64MB for 12C and 16C, which is massive already (Intel's top chips have 2MB/core - Zen 2 is 4-6MB/core)? Do we know that additional cache beyond the 70+MB total for these higher end chips (or even more than 32MB for the 8C/16T) would help? I know larger L3$ on Intel chips did provide a performance boost, mild, but is there a tipping point for performance, or a limit beyond which L3$ size loses benefit? I only ask because I hear L3$ is expensive, not just cash wise, but also for space.

Second, assuming an additional cache layer with L4$ would be beneficial, why wouldn't it be implemented into the I/O chiplet, and make the I/O chiplet just a little bigger? It seems there is plenty of space for expansion of the I/O.

A "L4$" would only make sense when implemented into the IOC as that would allow saving round trips otherwise necessary to reach other CCX's and as such open up the bandwidth of those IOC to chiplet IF-links for other uses.

amrnuke said:
This brings up another question. Each chiplet has two 4-core CCX, each 4-core CCX has access to the L3$ on that CCX, but will it have access to the L3$ on the adjacent CCX on the same chiplet? Even if not, for each CCX, there is 16MB L3$, which is a substantial increase over anything we've seen on mainstream consumer CPUs before.

Technically it's not 16MB per CCX but 4MB per core. Every core has write access to its own 4MB L3$. Every core has read access to all L3$ of the whole chip. Obviously read access to the 12MB L3$ of the other cores within the same CCX is faster than the L3$ of all the cores on other CCX's.

I believe that even for the two CCX's on the same chiplet the access between them causes a round trip through the IOC. But that will have to be tested once the chips are publicly available.

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Lifer

Senior member

Platinum Member

Senior member

Senior member

Golden Member

Senior member

Member

Member

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Lifer

Senior member

Senior member

Diamond Member

Diamond Member

Lifer

Golden Member

Lifer

Golden Member

Diamond Member