Speculation: The CCX in Zen 2

maddie · Aug 8, 2018

LightningZ71 said:
Increasing the number of basic units for EPYC would require additional IFOP connections on the chips and more reservation spots/switching targets in the internal IF uncore of the chips themselves. If you're going through all of that, it's likely going to be no more complex to add two more CCXs to the existing floorplan. While that will definitely up the transistor count, it shouldn't increase the effect beyond the existing 14/12nm true area.

Interestingly, though, if they wanted to, they could keep roughly the same basic layout of the individual chips at 7nm, but move the DRAM and IO controllers off the chip and onto a specific I/O chip, leaving the rest to be essentially CCX and IF chips on the same EMIB/MCM package. So, have 5 chips on one package, four with IF links to the 5th, and the 5th handling all the I/O between the package and the rest of the system. This way, they can change out DRAM controllers, PCI controllers, etc without having to redo the whole chip, or update the package for different applications in isolation of the cores. Having an EMIB/MCM package can allow them to run the IF links between the chips at similar speeds to what they do internally in the chips today. Consumer chips could be a mix of 2 to 4 7nm chiplets, and an I/O chiplet, and maybe contain an iGPU chiplet as well on dual CPU chiplet packages. At 7nm, but maintaining the existing AM4 socket, they'll have plenty of package size to play with for things like that. It would even be possible to integrate an HBM package in there as well. On a desktop product, cooling a package with two CPU chiplets, an iGPU chiplet, an HBM stack and an I/O chiplet would no be unreasonable. With low enough voltage and frequency targets, it could even work on mobile. Intel is already there with KL-G. Their pricing on the product is indicative of their uniqueness in the market and not entirely a product of cost of production.

Would AMD when they started designing Zen2 several years ago, be willing to take such a radical departure. Remember, every $ was scarce. I find it difficult to see the basic unit not being a complete stand alone CPU.

In their papers on interposer connected, composite CPUs, one of the points stressed was the ability to migrate early to an advanced node even if yields were comparatively poor and use innovative topologies to connect into a high core count CPU. That was the focus of the research. How to overcome the problems of early node fabrication. Other benefits were better binning options, etc.

AdoredTV did a recent video but this topic has been discussed here a long time ago using the same PDFs mentioned in his video.

They will have the greatest advantage now as they migrate to 7nm. Seeing as this move has been planned years ago, I'm pushed into the expectations I have.

Can an organic package accommodate the required connections for an 8 chiplet CPU? Nope. Seems as if a SI is needed for all the chiplets.

Jackie60 · Aug 8, 2018

I’m going to go out on a limb here and say neither four nor six but five.

maddie · Aug 8, 2018

jpiniero said:
Yield is going to suck, yes. But that's where Ryzen and Threadripper come in. I mean you won't see 16 core Ryzen next year, and they could even do what Intel did and introduce a 12 core r9 and keep the core counts of r7 and r5 similar.

They have 2 consumer lines now, Ryzen and TR. With TR2 it seems AMD has decided to market it into lower core count models that can be used for gaming also and higher core count ones for pure work related problems. Namely, X and WX models.

Keep Ryzen on 8 core maximum and push for CPU speed on the GloFlo 7nm process. Between IPC improvements, increased clocks, lower cost 100m^2 die, they can squeeze i9 from both directions with R7 3xxx and TR 3xxx. This so reminds me of a chess game. The available moves prevents any fantasy scenario from occurring and barring an act of utter stupidity, we see the inevitable conclusion. This 10 nm Intel fiasco is so much worse that many realize.

french toast · Aug 8, 2018

I think tuna has it spot on, 6 core CCX.
12 core die, then picasso successor has one 6 core CCX.
Assuming 50% more cores/cache. slightly wider cores with 2 X 256 bit SIMD units..what is the die size consensus?
They will want the die to be smaller than summit ridge no?
I don't think consoles will affect this.

They could have two 10 core SKUs...<60w 329$ part and a <80w 399$ part...save the full fat 12 core part for the 499$ price bracket.

William Gaatjes · Aug 8, 2018

For epyc, so much cores makes sense. But for the desktop, i think that the jump in cores is a bit too fast too often.
I agree that for now , a 4 core ccx is much better fit for zen 2.
Improvements in IPC, improvements in the communication between the memory controllers and the cache / cores and wider paths and simd units.
ZEN3 will be 8 core ccx.

fibonacc · Aug 10, 2018

A 6 core CCX is unlikely, 64 is not a multiple of 6

64 core is happening with Rome, sadly can't tell more.
AMD received feedback from multiple sources for the first Epyc that the number of cores in a CCX should be increased to allow certain kind of server apps to run better. So my bet is on 8 cores.

Vattila · Aug 10, 2018

Why has my poll still 6-core in the lead? Change your votes!

Vattila · Aug 10, 2018

fibonacc said:
So my bet is on 8 cores.

My bet is on 4 x 4-core CCXs with a more sophisticated topology, interconnect and coherency protocol between the CCXs to bring down average latency between any two cores in the 16-core CCX cluster.

In my hypothetical design, I speculate that it will be implemented on a 28nm active interposer that houses all the uncore-logic with 4 tiny 7nm CCXs mounted on top. 4 of these interposers in the package gives you 64-core EPYC.

Sounds sweet to me.

french toast · Aug 10, 2018

fibonacc said:
A 6 core CCX is unlikely, 64 is not a multiple of 6 64 core is happening with Rome, sadly can't tell more.
AMD received feedback from multiple sources for the first Epyc that the number of cores in a CCX should be increased to allow certain kind of server apps to run better. So my bet is on 8 cores.

Are you some kind of insider (or faker

)...or is this second hand knowledge from a source of yours?

JoeRambo · Aug 10, 2018

fibonacc said:
AMD received feedback from multiple sources for the first Epyc that the number of cores in a CCX should be increased to allow certain kind of server apps to run better. So my bet is on 8 cores.

It would make a lot of sense to go with 8 cores, that opens quite a few avenues in server computing. And lets not forget another weakness of current CCX, 8MB L3 "domain", hard to memory busy workloads. Rumours have it, that AMD is increasing L3 to 4MB per core.

8C CCX with 32MB of L3 -> that is a dream setup both for desktop and servers! We can now fit 4 instances of our app on 2x20C intel, i think at least 6-7 per 64C Epyc would be possible, some epic progress for sure

Vattila · Aug 10, 2018

JoeRambo said:
It would make a lot of sense to go with 8 cores

If they go with a chiplet design as rumoured, then an 8-core CCX chiplet will be larger, yield worse and be more costly than a 4-core CCX chiplet. And it would be less reusable in the consumer space.

My bet is on 4 x 4-core CCXs with a more sophisticated topology, interconnect and coherency protocol between the CCXs to bring down average latency between any two cores in the 16-core CCX cluster.

A 4-core CCX chiplet would be tiny on 7nm (25-50 mm²), and hence reduce cost and increase yield and volume on the new and expensive 7nm processes.
A relatively small 200 mm² active interposer on the perfected 28nm process would be dirt cheap.
A 200 mm² die (the interposer with 4 chiplets on top) would fit into the current packaging scheme with few changes: 4 interposers for EPYC and high-core-count Threadripper WX, 2 interposers for low-core-count Threadripper X, and 1 interposer for mainstream Ryzen.

What's not to like?

https://forums.anandtech.com/threads/speculation-the-ccx-in-zen-2.2513648/page-7#post-39528340

french toast · Aug 10, 2018

I can't believe that we would see 32gb L3 on desktop with 1 8 core CCX, would be good though!.
No one seems to be accounting for the increased transistors required for wider cores.

Vattila · Aug 10, 2018

french toast said:
No one seems to be accounting for the increased transistors required for wider cores.

Not true, I am.

That's why I quote 25-50 mm² for the 7nm 4-core CCX chiplet — 25 is a little bit bigger than a straight shrink, and 50 is the upper limit for four of them to fit it on a ~200 mm² interposer, which is my estimate based on the current size of "Zeppelin" (213 mm²).

Abwx · Aug 10, 2018

8C CCX is to be expected as 6C CCX does not make any sense, this would mean that they ll go from 32C to 48C MCM while Intel would get from 28C to 56C in the same time, it s unlikely that AMD is to abandon the core count advantage they currently hold.

Vattila · Aug 10, 2018

Abwx said:
8C CCX is to be expected

Why not 4 x 4-core CCX chiplets on an active interposer? See my earlier posts.

french toast · Aug 10, 2018

So the consensus on here is we are probably getting 16 cores, just differences in topology.

I am sticking with 6 core ccx.

JoeRambo · Aug 10, 2018

french toast said:
No one seems to be accounting for the increased transistors required for wider cores.

If we agree that Intel has "wide" cores, to add 4MB of L3 and 2 cores Intel used 25mm^2 on 14nm++ ( so probably actual increase is even less, as other cores also have grown due to relaxed process? ). And the resulting chip already has 6 cores and 12MB on 150mm^2 PLUS GPU.

So AMD is using 7nm that is touted as having big advance in density, building a chip without GPU and still has trouble with sizing it?

Vattila said:
That's why I quote 25-50 mm² for the 7nm 4-core CCX — 25 is a little bit bigger than a straight shrink, and 50 is the upper limit to fit it on a ~200 mm² interposer, which is my estimate based on the current size of "Zeppelin" (213 mm²).

In my opinion this is the case of hammer and nail syndrome, convince Yourself that AMD is using chiplets and vegan tear sauce, and then You have to fit those "chiplets" on 200mm^2 sized interposers and invent have to invent 4x4.

All that when rumours are talking about 64C with 256M l3.

french toast · Aug 10, 2018

JoeRambo said:
If we agree that Intel has "wide" cores, to add 4MB of L3 and 2 cores Intel used 25mm^2 on 14nm++ ( so probably actual increase is even less, as other cores also have grown due to relaxed process? ). And the resulting chip already has 6 cores and 12MB on 150mm^2 PLUS GPU.

So AMD is using 7nm that is touted as having big advance in density, building a chip without GPU and still has trouble with sizing it?

In my opinion this is the case of hammer and nail syndrome, convince Yourself that AMD is using chiplets and vegan tear sauce, and then You have to fit those "chiplets" on 200mm^2 sized interposers and invent have to invent 4x4.

All that when rumours are talking about 64C with 256M l3.

I'm talking about specifically 16 core die..whether that be 2x8 or 4x4 (unlikely imo)..would be surprised if transistors increased by 50% per core (incl cache)..double the cores and that seems too big for early 7nm imo.
I think they would want a smaller die than 213mm2 for 7nm.
I'm going for 12 core ryzen 3xx, 36/48 core TR3, 64 core Epyc 2 Rome...with Rome using a different die with 8 core CCX, larger caches, SMT3/4.

moinmoin · Aug 10, 2018

I can't see how increasing the amount of cores in a CCX will be any progress for AMD. We already had the discussion how a 4 core CCX is the most Zen like and how adding any more cores increases routing complexity tenfold. At this point this connection complexity is something for IF to handle (e.g. through more CCXs on one die, potentially adding a L4$ etc.), not for a new CCX design.

Instead for the Zen 2 CCX design I'd expect AMD to make it wider, ideally without actually making every actual core wider. Taking reverse notes from the Bulldozer school of designs (I know I know) the new CCX design could partly combine the frontend of 2 cores each to effectively allow for a SP and DP mode. As a result the latter mode could implement SMT4 and AVX512 (combining Octa-issue 128-bit FPU) without losing the current efficiency of the former, with the advantage that the power and resource use of wider features is more predictable than Intel's current approach.

(One could spin this further and make such a new 4 core DP CCX the new default, effectively a 8 core SP CCX and simplify the resulting layout to not increase routing complexity over the current 4 core SP CCX. But this likely would decrease the efficiency of the SP cores.)

Trumpstyle · Aug 10, 2018

french toast said:
So the consensus on here is we are probably getting 16 cores, just differences in topology.

I am sticking with 6 core ccx.

I'm sticking with 6 core ccx for desktop and 8 core ccx for servers. We got very strong rumors pointing towards this.

Gideon · Aug 10, 2018

JoeRambo said:
In my opinion this is the case of hammer and nail syndrome, convince Yourself that AMD is using chiplets and vegan tear sauce, and then You have to fit those "chiplets" on 200mm^2 sized interposers and invent have to invent 4x4.

All that when rumours are talking about 64C with 256M l3.

I also find it very hard to believe AMD is jumping to Active interposer already with their first iteration of Zen2. The complexity and risk is IMO way too much to be worth it. What if EPYC2 would be totally ready by Q1, but because of some issues and complexity with their very first Active Interposer would need multiple respins and be postponed 6+ months?

That said, repeating the 8-core CCX mantra over and over also seems like the hammer and nail syndrome. The extra connections needed between each and every core within CCX means that they have to opt for some exotic topology within a CCX. It would make much more sense to improve communication between CCX's.

JoeRambo said:
If we agree that Intel has "wide" cores, to add 4MB of L3 and 2 cores Intel used 25mm^2 on 14nm++ ( so probably actual increase is even less, as other cores also have grown due to relaxed process? ). And the resulting chip already has 6 cores and 12MB on 150mm^2 PLUS GPU.

So AMD is using 7nm that is touted as having big advance in density, building a chip without GPU and still has trouble with sizing it?

And you are vastly underestimating the benefits of having a smaller die. From this paper:

Not only would yield improve, but average clock-speeds would also noticeably improve on the same node (the smaller the chiplet the higher the clocks). It's that you would hit diminishing returns somewhere around 8-4 cores and face communication overhead, but still that doesn't rule out 2x 4CCX vs 1x8CCX. Personally i just find 8 core CCX really hard to believe.

Vattila · Aug 10, 2018

JoeRambo said:
In my opinion this is the case of hammer and nail syndrome, convince Yourself that AMD is using chiplets and vegan tear sauce, and then You have to fit those "chiplets" on 200mm^2 sized interposers and invent have to invent 4x4.

Ah. So you are dismissing the latest buzz around the rumour of a chiplet and interposer design for 64-core EPYC. Fair enough. I would like to as well, as it confused me until I came up with this latest 4 x 4-core CCX chiplet hypothesis. I used to have a much simpler vision about Zen 2 and Zen 3 beforehand ("Zeppelin" replacement with 3 CCXs for 48-core EPYC 2, 4 CCXs for 64-core EPYC 3).

What interconnect topology do you think your preferred 8-core CCX would use? Would it build on a 4-core optimal direct-connect (in which case it would be some kind of super-CCX)? Or a flattened topology, such as ringbus or mesh, with a more uniform latency, albeit worse than direct-connect? Or a more sophisticated topology ("butter doughnut", etc.)?

My hunch is that AMD will build on direct-connect, and unfortunately that does not scale beyond 4 cores. So the way I see it, the topology will be a complex hybrid building on optimally connected quad-cores (CCXs). I discuss this in the OP of this thread.

An interesting note is that, if AMD goes with interposer and chiplets, they will have a lot of metal layers to play with for the interconnect — layers in the interposer die, plus layers in the chiplet mounted on top.

JoeRambo · Aug 10, 2018

Gideon said:
And you are vastly underestimating the benefits of having a smaller die. From this paper:

The benefits are known since the dawn of Silicon processing. But why not extend Your train of thought and build even smaller dies, 2 cores? 1 core? 1 "atom like " core? Where do we stop.

8C CCX has all the benefits, while sticking to decent manufacturability ( compared to 28C monsters ).

Trumpstyle · Aug 10, 2018

Vattila said:
If they go with a chiplet design as rumoured, then an 8-core CCX chiplet will be larger, yield worse and be more costly than a 4-core CCX chiplet. And it would be less reusable in the consumer space.

My bet is on 4 x 4-core CCXs with a more sophisticated topology, interconnect and coherency protocol between the CCXs to bring down average latency between any two cores in the 16-core CCX cluster.

A 4-core CCX chiplet would be tiny on 7nm (25-50 mm²), and hence reduce cost and increase yield and volume on the new and expensive 7nm processes.

A relatively small 200 mm² active interposer on the perfected 28nm process would be dirt cheap.

A 200 mm² die (the interposer with 4 chiplets on top) would fit into the current packaging scheme with few changes: 4 interposers for EPYC and high-core-count Threadripper WX, 2 interposers for low-core-count Threadripper X, and 1 interposer for mainstream Ryzen.

What's not to like?

https://forums.anandtech.com/threads/speculation-the-ccx-in-zen-2.2513648/page-7#post-39528340

Stuff just doesn't scale perfectly, a 4-core ccx would not be 50mm2 on 7nm but more likely 125mm2, while a 6-core ccx would be 150mm2. This is because the cpu cores scales good but there is random stuff in the chip that don't scale well at all. So I put the odds we seeing some kind of 4-core ccx at 0%.

But let's see.

Glo. · Aug 10, 2018

maddie said:
Predictions

Stays with 2 4Core CCX = 8 core basic unit as exists today.
No seperate uncore but more L3 cache
Fabric speed increases to accomodate AM4 memory limitations
Improved layout and de-bottlenecking = greater IPC + increased Clocks
More than 4 basic units for higher count EPYC CPUs = 8 x 8C die [64 cores]
EPYC on passive interposer of ~ 900mm^2 [minimal cost increase]
EPYC die fabbed at TSMC process for absolute efficiency.
Ryzen 3xxx fabbed at GloFlo process for higher clock speeds.

Almost all correct, aparat from L3 cache. 64 core EPYC will have 256 MB L3 cache which divided 8 times gives 16 MB’s.

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Diamond Member

Member

Diamond Member

Senior member

Lifer

Junior Member

Senior member

Senior member

Senior member

Golden Member

Senior member

Senior member

Senior member

Lifer

Senior member

Senior member

Golden Member

Senior member

Diamond Member

Member

Platinum Member

Senior member

Golden Member

Member

Diamond Member