Speculation: The CCX in Zen 2

Vattila · Aug 10, 2018

AtenRa said:
Do you actually believe you use less with IF connecting 2x CCXs ??

If it is not obvious, try this: Again draw 8 small squares, representing cores, on a paper. Now partition them into two groups of four with a dashed line down the middle. Fully connect the cores in each group (6 links each).

Now you can start experimenting with interconnecting the two groups across the dashed line. Note that at a minimum a single additional link will do. But you will be able to decrease the maximum number of hops between any two cores in separate groups by adding more links. You may add additional intermediate nodes (routers). Draw these as small circles. You are now creating complex topologies (hyper-cube, fat tree, etc.).

PeterScott · Aug 10, 2018

AtenRa said:
Do you actually believe you use less with IF connecting 2x CCXs ??
And again, what is the problem of having more interconnects within a 7nm CCX vs 14nm CCX ??

28 internal interconnect buses are simply not manageable. There is also the timing issue, of when they can be active. High numbers of direct interconnects simply aren't done.

Direct connect is simply not scaleable to higher core counts. Which is why we have Ring/Mesh architectures on higher core counts.

AMD could go to a 8 core building block, but it would almost certainly be a different kind of topology than direct connect. Ring/Mesh being main candidates.

french toast · Aug 10, 2018

Mesh is surely the immediate future no?..butter donut coming to connect 8+ dice on an active interposer.
Don't see why they can't do this now in preparation...having many 4 core blocks direct connection, all connected via infinity fabric seems like a lot of connections...we already know the connections on zen sap alot of power, certainly more than Intel's ring bus, likely more than Intel's MESH?.
Why can't they have a 6-8 core CCX with mesh topology, connect various CCX via infinity fabric as they do now, then connect many dice together using butter donut on active interposer for Epyc.

The more you guys talk about connection complexity the more I think we need another topology.

Gideon · Aug 10, 2018

PeterScott said:
28 internal interconnect buses are simply not manageable. There is also the timing issue, of when they can be active. High numbers of direct interconnects simply aren't done.

No one does it this way. Which is why we have Ring/Mesh architectures on higher core counts.

AMD could go to a 8 core building block, but it would almost certainly be a different kind of topology than direct connect. Ring/Mesh being main candidates.

Yes I don't get why people are so adamant about increasing the CCX size.

It's seems to be the belief that connections within CCX are "fast" but outside are "slow".

The problem is, that the connections are fast because of direct interconnects (every core has a direct link to every other core). This isn't really doable with 8 cores and even 6 cores is really pushing it (15 connections instead of 6). Now if you use any other topologies within the CCX it kind of beats the point of a CCX.

The other side of the coin is that the connections between CCX's don't have to be that slow. Currently the latency between two CCX's is almost twice as bad as to memory and back. and latency between CCX's on two different dies is only 30-50% worse than on the same chip - that's ridiculous.

Why I personally don't believe in 8-core CCX is the fact that it would make it impossible to make mobile chips with less than 8 cores. It would either mean that every single 7nm 15W TDP mobile APU would either have 2-4 disabled cores (what a waste of silicon) or they end up designing both 4 core and 8 core CCX'es anyway, which doubles the engineering effort.

Glo. · Aug 10, 2018

AtenRa said:
Again, what is preventing you of having more connections within a single 7nm CCX ?

Simple question - simple answer.

Efficiency.

You do not design simplest way of connection between cores, so that it consumes as low amount of power, as possible to destroy it, by going more cores per CCX.

It would be equally brilliant technical decision, as was Bulldozer's focus on multithreaded performance instead of single core.

PeterScott · Aug 10, 2018

Here is a interesting Topology PDF. How to connect cores is a huge and interesting topic.
https://pdfs.semanticscholar.org/presentation/a4a1/b9fc2822facf0b1a439287f05d3141713ed7.pdf

Slide 13 is interesting because they talk about early simple options; Buses and point-to-point Interconnects. The 4 Core Interconnect, looks just like CCX. Note that the other half of the slide says: in big letters. Does not Scale!

Slide 35 is interesting because they show essentially a 6 core interconnect (what a mess) and the text says in bold: Not scalable!! Cannot layout more than 4-6 cores in this manner for area and power reasons

The 4 core CCX is AMDs building block for the time being, and I don't think they are anywhere near abandoning it. I think it easily scales to 4 CCX on a die. Beyond that AMD may look at alternatives. I have seen a "concentrated Mesh" that looks like a 4 core CCX at each node of the mesh, so perhaps CCX is here to stay with AMD.

Regardless of what is farther in the future, I would bet on 2019 AMD chips all having 4 core CCX. I just don't know how many of them.

teejee · Aug 10, 2018

AtenRa said:
I know very well how RYZEN die is, thank you.

Well, it doesn't looks like you understand how the intra/inter-communication works though. If you did then why didn't you propose a solution to intra communication with a 8-core CCX?

Going to 8 core CCX will increase intra CCX latency quite a lot ( since direct interconnect makes no sense in this case). So it is mainly cases where 5-8 related threads are allocated to the the same CCX where we will see an improvement with 8-core CCX compared to Zen 1. And cases with 2-4 related threads on the same CCX will get worse latency.

moinmoin · Aug 10, 2018

french toast said:
The more you guys talk about connection complexity the more I think we need another topology.

I expect multiples of 4 to be AMD's choice for now:
CCX with 4 cores. (status quo for Zen)
Dies with 4 CCX. (more likely than an increase of cores per CCX)
MCM with 4 dies. (status quo for Epyc, Threadripper 2)
4 sockets with an MCM each. (possibility for a new Epyc platform after the uncore IO per die/MCM at least doubles)
The advantage of this topology is that every node instance can be power gated without affecting the remaining nodes at the same level in the topology (unlike e.g. ring bus and mesh) while every node's 4 children are directly connected with each other. The limit of that topology is 4⁴ = 256 cores for a 4 socket system. The question is how to go beyond? Add another level of 4 through some use of chiplets (4⁵ = 1024 cores)?

Schmide · Aug 10, 2018

The biggest thing that influences my thoughts is AMD does not give up on designs.

Entertain for a second that AMD keeps the design very close to the current 2x4 design.

If each chip has 4 external interconnects, you could merge 2 dies and have a total of 8 interconnects. Between these 2 chips 4 interconnects go between them and 4 go outside to the rest of the multi chip package.

Communication between the merged dies is provisioned at 2x while multi chip communication happens at 1x.

Yes it inserts an extra hop between high and low 2xccx.

The rest of the redesign is removing excess ddr, pic-e, etc

Vattila · Aug 10, 2018

Thanks for all the delightful speculation! I am feeling a little bit under the weather today — I exhausted myself yesterday, and may have caught a bug. Still, I have had a good day, reading all these great posts. I don't feel like reprimanding a single poster. Thanks all!

AMD clearly has a lot of options evolving their design. I cannot wait for clarification.

Schmide · Aug 10, 2018

Oh and btw. Those merged dies. They have a controller that could be tweaked.

LightningZ71 · Aug 10, 2018

If it was simple or ideal to use 8 cores in a CCX, AMD would have done it with the first zen die instead of having to manage communications between two different CCX units per die. As you expand the direct connect CCX that AMD uses, you drastically increase the number of core connections that have to be routed from the cores, and you also drastically complicate the management of the CCX L3 cache section. Essentially, you add latency to the most common cross core link (inner CCX communications), while not decreasing it anywhere. You also reduce the improvement of NUMA node optimization strategies. Instead, you can just add one or two 4 core CCX units while preserving best case latencies, preserving latency numbers for the most common core to core links, and likely don't degrade worse case latencies by improving the IF of the uncore via higher clocking, increasing width, or creating a secondary IF crossbar for the CCX units with a common double width reservation station for I/O access.

jpiniero · Aug 10, 2018

LightningZ71 said:
If it was simple or ideal to use 8 cores in a CCX, AMD would have done it with the first zen die instead of having to manage communications between two different CCX units per die.

Not necessarily, because they were likely taking Raven Ridge into consideration.

Schmide · Aug 10, 2018

jpiniero said:
Not necessarily, because they were likely taking Raven Ridge into consideration.

Are you implying that they didn't make an 8 core because they wanted to split it for 4 core parts?

What do you say to the 28 interconnects per 8 needed?

jpiniero · Aug 10, 2018

Schmide said:
Are you implying that they didn't make an 8 core because they wanted to split it for 4 core parts?

I'm saying they wanted a 4 core CCX because they planned to use 1 for Raven Ridge and 2 for Summit Ridge.

fibonacc · Aug 11, 2018

I have no idea if there are 4 or 8 cores per CCX, just that 6 doesn't work well with 64c

William Gaatjes · Aug 11, 2018

Makes me wonder, let say that AMD goes for more cores and lowers clock and therefor voltage to reducing power consumption.
Would 3D stacking not be an option if they manage to get the heat out from the lower die ?
The ccx chiplets would be placed on top each other and TSV would be used to connect the 2 ccx chiplets. That would perhaps be shorter trace lengths and then an finished 8 core CCX requiring 28 connections be less of an issue.
But it all depends on power consumption.

Glo. · Aug 11, 2018

William Gaatjes said:
Makes me wonder, let say that AMD goes for more cores and lowers clock and therefor voltage to reducing power consumption.
Would 3D stacking not be an option if they manage to get the heat out from the lower die ?
The ccx chiplets would be placed on top each other and TSV would be used to connect the 2 ccx chiplets. That would perhaps be shorter trace lengths and then an finished 8 core CCX requiring 28 connections be less of an issue.
But it all depends on power consumption.

HEAT DENSITY!

Something like this would completely eradicate everything AMD wants to achieve with next gen Zen architecture.

William Gaatjes · Aug 11, 2018

Glo. said:
HEAT DENSITY!

Something like this would completely eradicate everything AMD wants to achieve with next gen Zen architecture.

Yeah, i was afraid of that.
Dram (HBM) is possible because the density is high but the gate clockspeed in the core is rather low in comparison to cpu cores that have logic gates switching at frequencies at least 10 times higher.
Of course the i/o interface from hbm runs at a high clock but that is a small part of each die.
When(If) ballistic conduction becomes practically possible, then 3d stacking of cores of course will no longer be an issue.

maddie · Aug 11, 2018

William Gaatjes said:
Yeah, i was afraid of that.
Dram (HBM) is possible because the density is high but the gate clockspeed in the core is rather low in comparison to cpu cores that have logic gates switching at frequencies at least 10 times higher.
Of course the i/o interface from hbm runs at a high clock but that is a small part of each die.
When(If) ballistic conduction becomes practically possible, then 3d stacking of cores of course will no longer be an issue.

What do you mean?

The main energy savings in HBM is to run the circuits at a lower speed, thus lowering needed voltage, and compensating for that, by increasing the # transfers. Power is directly proportional to F times V squared. The increase in # transfer lines pretty much negates the lower speed, but the decreased V is critical. Also, you also don't need a stack to achieve the power savings of HBM.

dnavas · Aug 11, 2018

Glo. said:
HEAT DENSITY!

Heat is going to be a problem, indeed.
Perhaps I missed it, but has anyone done an analysis of how a "40% performance boost OR 60% total power reduction" is going to yield 16 cores at 5Ghz? I've seen a lot of speculation about higher boost speeds and higher core counts, but I have a hard time seeing 12 cores at (or around) 5Ghz, nevermind 16.... Oh yes, and more bandwidth, lower latency, and wider vector ops too.
Engineering is about tradeoffs. What are we going to get, and what is AMD going to leave on the table? And having left whatever those items are on that proverbial table, which markets are AMD ceding?

I'd start with what's possible. Let's say, despite all previous history with Glofo numbers, that these percentages are actually accurate. Let's also assume that they ship on time, and on budget. And then let's let everything scale linearly. (optimism is my drug of choice) The R1700 is 65W @ 3Ghz. So, without changing anything, that would be 65W @ 4.2Ghz, or 26W @ 3Ghz. Immediately I see huge upsides for my server parts. 32 cores at 3Ghz is going to be wildly easy. 104W. 64 cores is right around 210W. Or take the 7601's 2.7Ghz / 180W specs ( * .4 / 2) and wind up with a tdp of 145W for 64 cores. Easy server wins. But the desktop?
The desktop parts need to get faster. Boost clocks need to aim to rise something like 25% over the 1800X. That leaves us with 11 cores at 5Ghz and a tdp of 95W. 105W for 12 cores. 140W for 16 cores. Would AMD really ship a 140W 16 core processor? For AM4?

256b-width vectors are probably easy on the floorplan, but likely difficult to survive thermals. IF speeds are also considerations. Can AMD find enough power savings to pay for all those expenses, or will one of these be sacrificed? If you want high core counts, you need to spend budget on the interconnects. As these past few pages attest to, this is not a simple problem.

Given the opportunity, I expect that the server parts drive core design, and that the desktop parts have to live with the tradeoffs. What do server parts need? More L3, a little more relaxed on the latency. Do server parts care about the latency in a hierarchy of 4 cores x 4 CCX x 4 dies? Probably not -- NUMA is fine. How would desktop respond? Lose the core-based marketing strategy on desktop, and cut those wide vector units if they have to. Desktop might have a 16 core chip, but not at high speeds. Plus, without high speed DDR5 they're likely to starve anyway. Low speed 16 core maybe. 12 core high speed w/o the gfx, 8 cores with a builtin gpu, 8 core non-gfx very high speed for the dedicated gamer. Meanwhile, split my Threadripper market in two. In fact, might as well introduce the idea of cores vs speed now -- witness WX vs X. I expect WX market gets the low-speed 16 core setups, while the X series gets two 12 cores.

I don't expect buttered, fried dough on my Zen2 IF design. :shrug: Hey, if they manage it, great, but being in the front of the 7nm launch gate seems risky enough.

William Gaatjes · Aug 11, 2018

maddie said:
What do you mean?

The main energy savings in HBM is to run the circuits at a lower speed, thus lowering needed voltage, and compensating for that, by increasing the # transfers. Power is directly proportional to F times V squared. The increase in # transfer lines pretty much negates the lower speed, but the decreased V is critical. Also, you also don't need a stack to achieve the power savings of HBM.

Oh, you are right, i agree. But HBM2 i/o interface is stated as 1.6Gb/s(2Gb/s) max. So the i/o interface to the cpu or gpu runs at a maximum clock of 1.6Gb/s and even 2Gb/s. This number is the per bit transfer speed. So the i/o circuit with latches, fifo and buffers needs to run on that speed but all the other parts of the dram dies run at a much lower clock.
The amount of bits (2048 or 4096) times this transfer speed divided by 8( to get a per byte value) is the raw max high bandwidth value we see.
https://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification

William Gaatjes · Aug 11, 2018

dnavas said:
Heat is going to be a problem, indeed.
Perhaps I missed it, but has anyone done an analysis of how a "40% performance boost OR 60% total power reduction" is going to yield 16 cores at 5Ghz? I've seen a lot of speculation about higher boost speeds and higher core counts, but I have a hard time seeing 12 cores at (or around) 5Ghz, nevermind 16.... Oh yes, and more bandwidth, lower latency, and wider vector ops too.
Engineering is about tradeoffs. What are we going to get, and what is AMD going to leave on the table? And having left whatever those items are on that proverbial table, which markets are AMD ceding?

I'd start with what's possible. Let's say, despite all previous history with Glofo numbers, that these percentages are actually accurate. Let's also assume that they ship on time, and on budget. And then let's let everything scale linearly. (optimism is my drug of choice) The R1700 is 65W @ 3Ghz. So, without changing anything, that would be 65W @ 4.2Ghz, or 26W @ 3Ghz. Immediately I see huge upsides for my server parts. 32 cores at 3Ghz is going to be wildly easy. 104W. 64 cores is right around 210W. Or take the 7601's 2.7Ghz / 180W specs ( * .4 / 2) and wind up with a tdp of 145W for 64 cores. Easy server wins. But the desktop?
The desktop parts need to get faster. Boost clocks need to aim to rise something like 25% over the 1800X. That leaves us with 11 cores at 5Ghz and a tdp of 95W. 105W for 12 cores. 140W for 16 cores. Would AMD really ship a 140W 16 core processor? For AM4?

256b-width vectors are probably easy on the floorplan, but likely difficult to survive thermals. IF speeds are also considerations. Can AMD find enough power savings to pay for all those expenses, or will one of these be sacrificed? If you want high core counts, you need to spend budget on the interconnects. As these past few pages attest to, this is not a simple problem.

Given the opportunity, I expect that the server parts drive core design, and that the desktop parts have to live with the tradeoffs. What do server parts need? More L3, a little more relaxed on the latency. Do server parts care about the latency in a hierarchy of 4 cores x 4 CCX x 4 dies? Probably not -- NUMA is fine. How would desktop respond? Lose the core-based marketing strategy on desktop, and cut those wide vector units if they have to. Desktop might have a 16 core chip, but not at high speeds. Plus, without high speed DDR5 they're likely to starve anyway. Low speed 16 core maybe. 12 core high speed w/o the gfx, 8 cores with a builtin gpu, 8 core non-gfx very high speed for the dedicated gamer. Meanwhile, split my Threadripper market in two. In fact, might as well introduce the idea of cores vs speed now -- witness WX vs X. I expect WX market gets the low-speed 16 core setups, while the X series gets two 12 cores.

I don't expect buttered, fried dough on my Zen2 IF design. :shrug: Hey, if they manage it, great, but being in the front of the 7nm launch gate seems risky enough.

I too doubt the all cores clocks and 5GHz numbers. If we take power 8 for example from IBM, that cpu has a monstrous (configurable) TDP. It has a 22nm process but a process very well optimized for high frequencies.

french toast · Aug 11, 2018

5ghz all core 16 cores??..no one ever suggested that did they??

Glo. · Aug 11, 2018

dnavas said:
Heat is going to be a problem, indeed.
Perhaps I missed it, but has anyone done an analysis of how a "40% performance boost OR 60% total power reduction" is going to yield 16 cores at 5Ghz? I've seen a lot of speculation about higher boost speeds and higher core counts, but I have a hard time seeing 12 cores at (or around) 5Ghz, nevermind 16.... Oh yes, and more bandwidth, lower latency, and wider vector ops too.

There is a very good Song from Aerosmith called "Dream On" and its title perfectly fits this for two reasons

.

If you want to double the core count you throw away clock speeds.
And TSMCs process struggles with performance, but is good on efficiency if clocked properly.

GloFo process is less efficient but yields higher frequency on 7 nm process.

Think about it this way: TSMC CPUs will top at 4.7 GHz at 90W Power consumption.
GloFo will get to 5GHz, at 100W of power.

AMD will do 8C/16T design which they can scale into two dies on AM4 platform, which can give them core advantage IF Needed.

Think about it this way:
If Intel will not come up with IceLake-S CPUs, AMD will stick to 8 core design for 3000 series, and price it at 350$, with 5% higher IPC than Skylake, up to 4.2 GHz base/5 GHz Turbo core clocks@ around 95W TDP.
If Intel will come up with Icelake-S CPUs: 16C/32T@ 3.0 GHz Base/4.5 Turbo@105W TDP for 500$ on AM4 platform with Ryzen 4000 Series(!).

Thanks to this approach, which I talked previously(8C/16T design only) AMD can scale ALL of their platforms easily, at almost no additional cost of the design.

If you ask me:

I gladly will take 4.2 GHz/5GHz 8C/16T design over 16C/32T but low core clock one.

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Senior member

Platinum Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Platinum Member

Lifer

Diamond Member

Lifer

Junior Member

Lifer

Diamond Member

Lifer

Diamond Member

Senior member

Lifer

Lifer

Senior member

Diamond Member