Speculation: The CCX in Zen 2

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

How many cores per CCX in 7nm Zen 2?

  • 4 cores per CCX (3 or more CCXs per die)

    Votes: 50 43.9%
  • 6 cores per CCX (2 or more CCXs per die)

    Votes: 45 39.5%
  • 8 cores per CCX (1 or more CCXs per die)

    Votes: 19 16.7%

  • Total voters
    114
Oct 18, 2013
11,177
157
126
4 Zen1 cores will obliterate 8 jaguar cores easily especially in single-threaded limited scenarios. clock them at their optimum around 2.5 ghz and they are very power efficient. I admit the gpu part would be somewhat of an issue. Vega 56-like could be around 450mm2 + 96mm2 for 4 zen1 cores = 550mm2 on 14nm. maybe bit smaller on 12nm. For sure doable in 2020 when yields should be excellent.
I remember a graph when zen was fist being tested. I think saying very efficient is if anything an understatement. I remember being shocked at how low the power usage was when it was clocked lower.
 

Vattila

Senior member
Oct 22, 2004
363
58
136
Things have changed slightly here: The poll results now show 4-core CCX in the lead with 44%. 6-core is still at 41% (I chalk that down to stubbornness or loss of interest in this thread), and 8-core gained some proponents lately to 15%, presumably with the 8+1 chiplet rumours, speculation and details of Rome becoming public.

It seems that 4-core is now pretty much confirmed, with this tweet from VideoCardz.com:

"2x AMD Eng Sample: 2S1404E2VJUG5_20/14_N (64C 1.4Ghz 800MHz IMC, 64x 512KB L2, 16x 16MB L3)"

https://twitter.com/VideoCardz/status/1065553242518626304?s=19

In other words, there are 16 L3 caches, and as we know the L3 in Zen acts as a crossbar between cores in a CCX (CPU core complex). 64/16 = 4 cores in each CCX.

Now, some interesting topology questions: Are the cores connected to the IO chiplet through a 9-port Infinity Fabric router (all cores connect directly to this router for access to IO and external cores) or, more likely, does each CCX have its own 5-port IF router, with a third 3-port IF router connecting to the IO chiplet (or 6-port IF per CXX in a triangle arrangement), or less likely, just two 5-port IF routers with independent links to the IO chiplet?
 
Last edited:

HurleyBird

Golden Member
Apr 22, 2003
1,670
28
106
It does greatly increase the likelihood of a 4-core CCX, but at the same time Sandra (that's what this is going by the readout) doesn't always get every detail right on pre-release hardware.
 

Vattila

Senior member
Oct 22, 2004
363
58
136
Here is a back-of-the-envelope schematic, focusing on the topology of the Infinity Fabric routers. It shows one possibility, the triangle arrangement, for each CPU chiplet. Note that I have rotated the IO chip 90 degrees, compared to the block diagram shown by AMD, just to focus on the hierarchy.

Rome IF topology (speculation).png

Note that there is a lot of routing complexity left out, in particular pertaining to memory and IO. Each of the 8 IF routers shown on IO chiplet has to connect to the SCH (system controller hub), the 4 DCMCs (dual-channel memory controllers), and the PCI-Express ports.

Contrary to connecting all 8 IF routers directly to an IF router for each of these, which would require a lot of wiring, I suspect there is some asymmetry, with the left half of the IO chip (4 IF routers) connected to 2 DCMCs and half the PCI-Express lanes. Then each half is connected through a pair of IF routers with a fat link between them.

Having direct connections from all the 8 IF routers to the SCH (and the Secure Processor) seems overkill, since there it has less stringent requirements for bandwidth and latency. So I suspect this simply has a single connection to one of the IF routers, with all the others having to be route traffic to the SCH through the network between them.

Any thoughts about this will be appreciated.
 

Vattila

Senior member
Oct 22, 2004
363
58
136
Here is one hypothetical topology for hooking up the dual-channel memory controllers. In this case, the L4 (prefetch buffer, cache-coherency directory), if any, can be associated with the memory controller, and used to ascertain whether a memory access is needed or should be read from cache in one of the chiplets, and to update stale cache lines in case of writing to memory shared among the caches.

Note that this is a somewhat non-uniform memory architecture, but uniformity can be achieved, like in the L3, by interleaving memory between controllers (instead of allocating separate chunks).

Rome IF topology, memory controllers (speculation).png
 
Last edited:
Jun 5, 2002
162
7
101
Every if link adds latency to memory request, so they want to keep if-links as low as possible. With that kind of topology every access to memory would be slower than slowest in Naples. One direct if-link from chiplet to IO-chip is only sane one, if there's two ccx in chiplet ccx data change would still go through IO-chip. As it does with Zen1, there's no shortcut between CCX.
 

Vattila

Senior member
Oct 22, 2004
363
58
136
With that kind of topology every access to memory would be slower than slowest in Naples. One direct if-link from chiplet to IO-chip is only sane one
Which IF routers do you want to eliminate? You need IF endpoints for the chip-to-chip links. You need an IF endpoint for each CCX. And for each DCMC — unless you have a single IF endpoint for a 4-DCMC complex, but that would be a chokepoint wouldn't it?

if there's two ccx in chiplet ccx data change would still go through IO-chip. As it does with Zen1, there's no shortcut between CCX.
Yeah. Those CCX-to-CCX links are speculative, and I put them on a separate layer in my drawing. Here is the schematic without those links.

Rome IF topology, memory controllers, without direct CCX links (speculation).png

However, removing the direct CCX-to-CCX links doesn't remove any IF routers/endpoints. Actually, I speculate that each DCMC may need an additional IF endpoint (in between the IF router and the DCMC in my drawing), since it is doubtful that the IF endpoint for the DCMC, as I imagine it in AMD's reusable IP library, can act as a 5-port IF router the way I've drawn it.
 
Last edited:
Nov 6, 2018
94
98
66
As you know, not all of those IF-links are equal and it might make more sense to mark them in a different way. CCD-to-IOD links are most likely SerDes and have a much higher latency than any of the other links or crossbar switches. AMD likes to make interfaces (CCM (Cache-Coherent Master), UMC (Unified Memory Controller), CAKE (Coherent AMD socKet Extender), IOMS (I/O Master/Slave), NCM (Non-Coherent Master),, CCS (Cache-Coherent Slave), etc.) very clear and hide all the complexity (crossbar switches, topology) inside SDFs (Scalable Data Fabric). There's also SCF (Scalable Control Fabric) but it would be too confusing to add them both in the same diagram.

Anyway, I like the basic idea behind all your diagrams but you have to know all the above to really decipher them correctly. Still the big question is what AMD has done to hide most of the latency of those CCD-to-IOD links. One obvious one is doubling the L3 cache / core but there still must be something else too like L3 (and L2) shadow tags on IO die and maybe even "CCM shadow" interfaces on IOD.

But mostly one crossbar switch in each chiplet and a SerDes interface would be sufficient. There might be something else also but there's not too much room left. As for the IOD it will be a traffic hotspot and while 4 crossbar switches (6 nodes each, as you have drawn) might be enough maybe there would be some other way of doing this since six nodes for one crossbar switch is a lot. 5 should be more okay.

Still very interesting speculation and nice "hand-drawn" diagrams.
 
Nov 6, 2018
94
98
66
I was in a bit of a hurry when I wrote the last post but I really think that there might be some truth to that IOD is divided in quadrants which are then connected together using some low latency on-die IF-links (as shown in Vattila's diagrams). What I meant by crossbar switches is what AMD has shown us for Raven Ridge:



There are 4 crossbar switches of which one has 5 nodes, two 4 nodes and one 3 nodes. I think that the arrows to GPU should be bidirectional or is there some other way for GPU to do memory wites? I'm pretty sure that it must be an error in the illustration. Anyway, the CCX has the lowest latency as it's connected to the same switch as the memory controllers. IO Sub system and Multimedia Hub seems to be the furthest away from the memory controllers. Maybe they could have added one link between the left upper corner and the right lower corner swithes but that would have made the left upper corner switch to have 6 nodes which is a lot. Internal datapath is 256-bits wide which should be the same for the Zeppelin SoC. I was under the impression that they likely have doubled that (datapath width) for Rome.

Do we have a similar diagram of the Zeppelin SoC (with switches)? I haven't found one yet but maybe we should first try to make one based on all that we know about it. It would be easier to build all the Rome speculation on top of that.

There are still all the IO stuff (PCIe, IFIS, other IO) that is missing from your diagrams, Vattila, but I too think that we should base our topologies on the assumption that there are quadrants (on IOD) that are more or less identical to each other. I know that we should not read too much into that SiSoftware Sandra leak but it only mentioned 256MB of L3 and no L4. Still let's not assume too much but 14HP would still be a new process for AMD while they pretty much master 14LPP by now.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
363
58
136
What I meant by crossbar switches is what AMD has shown us for Raven Ridge
Yeah. We don't really know how these switches are implemented. Wikipedia's page on Multistage interconnection networks has a nice overview of different implementations of routing. For our speculation purposes I think using the generic "router" from network terminology is appropriate without getting bogged down in implementation details.

Actually, you could replace the central four IF routers in my diagram by a block called "IF transport layer", similar to the AMD block diagrams. Then everything is connected to this block. That said, I think it is interesting to open that block to speculate on any non-uniformity in the topology, i.e. whether some cores are closer to a particular memory controller than others, and how AMD will deal with that (e.g. by memory interleaving).
 
Last edited:

Vattila

Senior member
Oct 22, 2004
363
58
136
Here is an update of my Infinity Fabric topology diagram, with IF routing between Dual-Channel Memory Controllers (DCMC), PCI-Express/IF ports, and the System Controller Hub (SCH). I have squeezed the SCH into the heart of the IO chiplet for nice symmetry. I assume the SCH includes the secure processor and all southbridge functionality.

Note that I have now consistently drawn IF endpoints for every IP block (CCX, DCMC, PCI-E, SCH), as well as for inter-chiplet links (although the IF router on the CPU chiplets still has dual-duty as a router and an external endpoint — but you can imagine splitting that one too into a router and an endpoint).

Also note that I have now included Level 4 cache as part of the DCMC. As discussed before, this may act as a last level cache and a prefetch buffer, as well as a directory for cache-coherency (storing tags from all the caches in the CPU chiplets). Memory interleaving may be used to even out memory access latency and hide the non-uniform distance to memory.

Note that the IF routers at the heart of the IO chiplet may in turn be implemented by a multistage interconnection network to make the number of connections feasible.

If nothing else, I think my diagram illustrates the interconnection complexity for a big modular chip like this, hence why a network-on-chip makes sense, and why Infinity Fabric is such a big deal for scalable system architecture.

Rome IF topology (speculation).png
 
Last edited:

Vattila

Senior member
Oct 22, 2004
363
58
136
Here is yet another version. This one combines my first diagram, based on a central L4 cache split into 8 tightly connected slices (one per chiplet), with the IF network for the DCMC, PCI-E/IF and SCH. In this version, there are fewer IF hops to the L4 cache. The L4 cache is shown in gray, which also neatly indicates that it is speculative.

Rome IF topology, central L4 (speculation).png
 
Last edited:

Vattila

Senior member
Oct 22, 2004
363
58
136
Here is the Ryzen 3000 version of my IF topology schematic.

Ryzen 3000 topology (speculation).png
 
Apr 27, 2000
10,487
343
126
I did not know that stick men could have L4 cache. Well played, sir.

Should be interesting to see how your guesses stack up against what we see in reviews.
 

Similar threads



ASK THE COMMUNITY