Speculation: The CCX in Zen 2

LightningZ71 · Aug 10, 2018

There is no reason that the existing AM4 socket can not support a Ryzen package with two dies onboard at 7nm if they keep the die with its current layout of 2 X 4core CCX. That die gets TINY. Two of them on a package can be connected to each other just like TR1 was. The extra I/O lanes just go unused. The DDR channels can either be split between the two die, or they can both go to one die and the other can piggyback off of the primary die. With improvements made to the uncore clocks and the IF connections between the die, the net effect can be substantially reduced.

Even if they basically die shrink the existing Zen+ die to 7nm, and make minor internal tweaks, putting 6-8 of them on an epyc package is still going to be a significant undertaking. It will require more layers on the MCM for signal routing (more expense, greater chance for failure, etc) and, to get to a 2P solution like they currently have, it will require an expensive, and custom, glue logic chip. Right now, AMD needs to sell on value first, and both of those things don't help that.

The shrink to 7nm allows them to expand on chip resources without making significant changes to their existing packages and sockets. They have pledged to maintain AM4 and the Epyc/TR socket for a couple more years. 7nm won't change that. I expect 7nm to give us floorplan changes for the die, and only minor revisions to the packages. To get where AMD wants to be, though, they may be forced to do multiple die. I'm expecting that they will maintain the 4 core CCX to keep direct access between the cores to as minimal a delay as possible. I suspect that they will add additional CCX units as needed. It would not shock me to see a consumer 3 X 4 core CCX design get released through GloFo (Ryzen 3X00 from 200 to 700, focus on clocks) and a high end 4 X 4 core CCX design get used for Epyc, TR, and a HEDT AM4 solution ( Ryzen 3800) that focus on low energy and efficiency. Additional CCX units are not a major technological hurdle for the zen architecture as they all connect over the IF. If they make improvements to the IF uncore, they can mitigate the impact of additional inter CCX traffic.

Vattila · Aug 10, 2018

JoeRambo said:
By having 80ns of latency?

To quote myself: "Then optimise this topology by adding further connections as far as metal layers allow, creating a more complex and optimised topology, that brings down average latency between any two cores."

But do not throw out the baby with the bath water. A direct-connected 4-core CCX can be optimised near the theoretically lowest latency achievable.

Vattila · Aug 10, 2018

JoeRambo said:
That is simply not true. Instead of checking just 1 CCX, requests will need to be sent to 3 entities

Of course a system with a lower number of cores can have lower latency. I was referring to options for a 64-core EPYC 2. Adding direct-connected CCXs doesn't worsen the situation when you scale from 8 to 16 cores per chip. You still just have the cross-CCX latency "penalty", which still is just one hop between CCXs.

PeterScott · Aug 10, 2018

eek2121 said:
A lot of you guys aren't thinking this through. Zen2 has to be compatible with AM4. AM4 has dual channel RAM. Adding more cores will add bandwidth and latency constraints as cores become starved of RAM. Zen2 will be 2x4 core CCXes, just like previous designs. You won't see a core increase until a new socket.

I think the CCX will remain at 4 cores. AMD made an elegant building block with the CCX, with direct connection between each internal die/cache slice. That kind of direct connection probably doesn't scale will beyond 4 cores, so the 4 core CCX remains the building block for a while.

As far memory starvation. Look at 32 core threadripper. Only one memory channel per 8 cores. So they could have more cores on the next Ryzen die.

So then next Ryzen die could be 2 x CCX, or 3 x CCX. or 4 x CCX.

I think it could still be 2 x CCX (8core), but will rely on process/IPC improvements, or it could be 3 x CCX (12 core) for a core count improvement. I doubt there will 4 x CCX die anytime soon.

Vattila · Aug 10, 2018

Abwx said:
This would require yet another interconnect between CCXs,

Two CCXs are currently direct-connected. 4 CCXs can equally be direct-connected (6 links). Seems simple enough.

Sorry if I am repeating myself.

Trumpstyle · Aug 10, 2018

Vattila said:
Typo?

The size of a CCX is 45.5 mm² on 14LPP, and with over 2x density on the 7LP process, a straight shrink should be down to less than half the size. 25-50 mm² is allowing for some additional transistor budget for core improvements and larger caches.

Yeah I screwed up my thinking, was thinking of the chip. 2x ccx is atm 200mm2 on 14nm and will probably be about 125mm2 on 7nm despite 3x scaling improvement, that's what I meant and it's unlikely we see any kind of 4-core ccx because of this.

Trumpstyle · Aug 10, 2018

JoeRambo said:
That does not pass any common sense checks. Intel has Coffee Lake 150mm, with GPU on board. AMD 4 cores and 8MB of L3 are estimated 44mm^2. How can 6 cores be that large on a process more dense?

Yeah I meant the chip. 2x 4-core ccx on 14nm is 200mm2 and on 7nm it would likely be about 125mm2 because things don't scale perfectly.

JoeRambo · Aug 10, 2018

PeterScott said:
I doubt there will 4 x CCX die anytime soon.

Rumours have 64C epyc. It would be extreme NUMA if they have 8 dies in package ( but i would not be a farm on this not happening, as AMD just redefined NUMA with memory conn less TR ).

Vattila said:
Two CCXs are currently direct-connected. 4 CCXs can equally be direct-connected (6 links). Seems simple enough.

Are You aware that current "direct connection" has latency on the order of going to memory? 80ns is eternity. And latency and power expended will rise if number connections go from 1 to 3 and each of those 3 needs to be beefier to handle increased traffic.

From THG investigation:

Cross-CCX quantifies the latency between threads located on two separate CCXes, and we see a similar reduction thanks to overclocking. Notably, the Ryzen 7 1800X features much lower Cross-CCX latency than the stock Threadripper and most overclocked configurations. This is likely due to some form of provisioning, possibly in the scheduling algorithms, for Threadripper's extra layer of fabric.

As we can see, the overclocked Threadripper CPU in Game mode, which doesn't have an active fabric link to the other die, has the lowest Cross-CCX latency.

.

I have bolded it for You. Latency goes down in current gen if number of destinations is reduced by one. And you plan on increasing them 3x.

Vattila · Aug 10, 2018

JoeRambo said:
Are You aware that current "direct connection" has latency on the order of [80ns]?

It is not direct-connection that causes the cross-CCX latency. It is the routing from core to core across CCXs, i.e. between layers in the quad-tree topology. I discuss this in the OP. If you know of better topologies, let me know.

There is a lot AMD can do to further optimise the topology. See AdoredTVs video on this for an overview of the research they are doing.

Vattila · Aug 10, 2018

JoeRambo said:
From THG investigation [about inter-CCX latency]

These investigations have created a misunderstanding about the trade-offs made in topology choices and an undue hostility to having more than one CCX on a chip. The theoretically optimal direct-connection is thrown out as the proverbial baby with the bathwater, just to get a more uniform latency, worse than direct-connect but better than worst-case cross-CCX latency — as if the best-case latency of direct-connect doesn't count, and as if average latency across the chip doesn't count.

Hence we have this aversion to CCX partitioning among many here.

For scalability, you have to partition. There is no way around that. Ringbus doesn't scale. No one wants a mesh. Any alternatives? Butter doughnut, anyone?

AtenRa · Aug 10, 2018

I have just voted for 8xCores per CCX.

I really dont understand what problems you all find with this setup. You can have a shared L3 cache between all 8 Cores within the CCX, so what is the problem again ??

Vattila · Aug 10, 2018

AtenRa said:
what is the problem again ??

Topology. See OP.

What kind of interconnect topology between cores does your 8-core CCX have?

jpiniero · Aug 10, 2018

AtenRa said:
I have just voted for 8xCores per CCX.

I really dont understand what problems you all find with this setup. You can have a shared L3 cache between all 8 Cores within the CCX, so what is the problem again ??

You'd have to move the intra-ccx connection to something other than direct, maybe a ring. Which they could do but I just think it's unlikely.

Glo. · Aug 10, 2018

AtenRa said:
I have just voted for 8xCores per CCX.

I really dont understand what problems you all find with this setup. You can have a shared L3 cache between all 8 Cores within the CCX, so what is the problem again ??

Very simple: CCX is cross-core communication. It is designed for best possible(lowest) latency, and highest possible efficiency. Adding even two extra cores destroys both of those principles, not to mention another 4 cores, to each CCX.

It won't happen.

french toast · Aug 10, 2018

So basically we have next to no clue what is going to happen...could be anything...8-16 cores, 1-4 ccx.

AtenRa · Aug 10, 2018

Vattila said:
Topology. See OP.

What kind of interconnect topology between cores does your 8-core CCX have?

Just double the 4 Core CCX, what is your problem with topology ??

Vattila · Aug 10, 2018

AtenRa said:
what is your problem with topology ??

See jpiniero's and Glo's replies.

Glo. · Aug 10, 2018

french toast said:
So basically we have next to no clue what is going to happen...could be anything...8-16 cores, 1-4 ccx.

We do have clue. It is 8 core CPU, made from dual 4 core CCX.

AtenRa said:
Just double the 4 Core CCX, what is your problem with topology ??

Communication over crossbar! Simple as it can be. Read the specs of Ryzen, from Anandtech's article, and you will understand what is the problem.

AtenRa · Aug 10, 2018

Glo. said:
Very simple: CCX is cross-core communication. It is designed for best possible(lowest) latency, and highest possible efficiency. Adding even two extra cores destroys both of those principles, not to mention another 4 cores, to each CCX.

It won't happen.

What is the problem of adding two/four cores to the CCX ??? you believe that adding 4x more cores to the CCX will increase latency more than having 2x CCXs connected by IF ?? I dont believe there is a problem connecting 8 Cores together in a single CCX.

PeterScott · Aug 10, 2018

AtenRa said:
What is the problem of adding two/four cores to the CCX ??? you believe that adding 4x more cores to the CCX will increase latency more than having 2x CCXs connected by IF ?? I dont believe there is a problem connecting 8 Cores together in a single CCX.

It's not adding cores, it's how the interconnects increase:

I believe the number of interconnect to for N objects is (N*(N-1))/2.

4 = 4*3/2 = 6 interconnects ( manageable )
6 = 6*5/2 = 15 interconnects ( unmanageable )
8 = 8*7/2 = 28 interconnects ( unmanageable )

Direct Connects are the fastest, but have to abandon that beyond 4 cores. You need a whole new idea, like a ring bus/mesh.

AMD built an elegant 4 core CCX and I think they will stick with it as a building block for a while.

Glo. · Aug 10, 2018

AtenRa said:
What is the problem of adding two/four cores to the CCX ??? you believe that adding 4x more cores to the CCX will increase latency more than having 2x CCXs connected by IF ?? I dont believe there is a problem connecting 8 Cores together in a single CCX.

How hard is it for you to read anything detailed about the design of Zen and Ryzen CPUs?

This is the problem:

Again, the Zen architecture employs a four-core CCX (CPU Complex) building block. AMD outfits each CCX with a 16-way associative 8MB L3 cache split into four slices; each core in the CCX accesses this L3 with the same average latency. Two CCXes come together to create an eight-core Ryzen 7 die (image below), and they communicate via AMD’s Infinity Fabric interconnect. The CCXes also share the same memory controller. This is basically two quad-core CPUs talking to each other over a dedicated pathway: Infinity Fabric, a 256-bit bi-directional crossbar that also handles northbridge and PCIe traffic. The large amount of data flowing through this pathway requires a lot of scheduling magic to ensure a high quality of service. It's also logical to assume that the six- and four-core models benefit from less cross-CCX traffic compared to the eight-core models.

Source: https://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html

Vattila · Aug 10, 2018

AtenRa said:
I dont believe there is a problem connecting 8 Cores together in a single CCX.

Then what is the topology you suggest for your 8-core CCX?

If you don't understand what I am asking, draw 8 small squares on a paper (representing cores), then connect them up with your pencil. How many links do you need? How many hops between cores? At a minimum? Maximum? Average? What is the widths of the links? What is the power consumption?

Then evaluate. Did you do better than two direct-connected 4-core CCXs?

AtenRa · Aug 10, 2018

PeterScott said:
It's not adding cores, it's how the interconnects increase:

I believe the number of interconnect to for N objects is (N*(N-1))/2.

4 = 4*3/2 = 6 interconnects ( manageable )
6 = 6*5/2 = 15 interconnects ( unmanageable )
8 = 8*7/2 = 28 interconnects ( unmanageable )

Direct Connects are the fastest, but have to abandon that beyond 4 cores. You need a whole new idea, like a ring bus/mesh.

AMD built an elegant 4 core CCX and I think they will stick with it as a building block for a while.

Do you actually believe you use less with IF connecting 2x CCXs ??
And again, what is the problem of having more interconnects within a 7nm CCX vs 14nm CCX ??

AtenRa · Aug 10, 2018

Glo. said:
How hard is it for you to read anything detailed about the design of Zen and Ryzen CPUs?

This is the problem:

Again, the Zen architecture employs a four-core CCX (CPU Complex) building block. AMD outfits each CCX with a 16-way associative 8MB L3 cache split into four slices; each core in the CCX accesses this L3 with the same average latency. Two CCXes come together to create an eight-core Ryzen 7 die (image below), and they communicate via AMD’s Infinity Fabric interconnect. The CCXes also share the same memory controller. This is basically two quad-core CPUs talking to each other over a dedicated pathway: Infinity Fabric, a 256-bit bi-directional crossbar that also handles northbridge and PCIe traffic. The large amount of data flowing through this pathway requires a lot of scheduling magic to ensure a high quality of service. It's also logical to assume that the six- and four-core models benefit from less cross-CCX traffic compared to the eight-core models.

Source: https://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html

I know very well how RYZEN die is, thank you.

AtenRa · Aug 10, 2018

Vattila said:
Then what is the topology you suggest for your 8-core CCX?

If you don't understand what I am asking, draw 8 small squares on a paper (representing cores), then connect them up with your pencil. How many links do you need? How many hops between cores? At a minimum? Maximum? Average? What is the widths of the links? What is the power consumption?

Then evaluate. Did you do better than two direct-connected 4-core CCXs?

Again, what is preventing you of having more connections within a single 7nm CCX ??? I really dont understand why you people having a problem with more interconnects within the CCX. You do all realize that IF is also connecting the two CCXs, so why use the IF and not a direct interconnect to the cores ??

But i will ask again, what is it that preventing you of using more interconnects with a denser 7nm CCX ?? And why do you all believe that having 2x CCXs connected by IF is faster or using less interconnects (both the interconnects of the CCXs + the IF) than having 8x Cores in a single CCX ???

Edit: I will try to make you a topology drawing

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Platinum Member

Senior member

Senior member

Platinum Member

Senior member

Member

Member

Golden Member

Senior member

Senior member

Lifer

Senior member

Lifer

Diamond Member

Senior member

Lifer

Senior member

Diamond Member

Lifer

Platinum Member

Diamond Member

Senior member

Lifer

Lifer

Lifer