Speculation: The CCX in Zen 2

Ajay · Aug 9, 2017

Arachnotronic said:
The answer to this question was revealed some time ago in a SemiAccurate article.

LOLZ, care too share?

Arachnotronic · Aug 9, 2017

Ajay said:
LOLZ, care too share?

Paywalled

Mopetar · Aug 9, 2017

I think that they'll stick with a 4-core CCX and make improvements to the core itself, get a die shrink of the current design, or perhaps be really aggressive and target a revamped core on the 7 nm process. That approach is going to yield the best results without having to significantly rework the underlying design.

I suspect that when AMD does change the CCX, they go to an 8-core design but won't have direct connections between all cores. Even then they might still keep the old 4-core design around as there are some segments of the market that aren't going to really need more than 4C/8T in the foreseeable future.

scannall · Aug 9, 2017

Plenty for them to do already. Refining and cleaning up their uarch. Pick the low hanging fruit. Pick up some IPC. A process shrink not that far off. And I'd add, that 4 cores per CCX is a LOT more flexible.

Ajay · Aug 9, 2017

Arachnotronic said:
Paywalled

Thx anyway.

dacostafilipe · Aug 10, 2017

Tuna-Fish said:
...

My bad, totally forget about Starship. The 6 core CCX makes sense now.

Topweasel · Aug 10, 2017

Problem is they also showed in the roadmap a APU with 4c8t debuting roughly at the same time imply that the CCX would still be 4 cores. My guess is 3 CCX. Setup in a loose triangle. Peoples worry is latency increase and personally another 5-10ns within a CCX with 6 cores would be better than the 80ns+ to nearly twice as many cores. But they can keep basically the same performance by doing a triangle three CCX's. Each CCX would connect on opposite ends to 2 CCXs which would maintain both internal and external latencies. Whereas if they do 4 or more CCX's they would either have increased latency like the ring bus, or dramatically increase IF connections.

Topweasel · Aug 10, 2017

TimCh said:
The cores in a single CCX is not connected using IF, IF is used to connect multiple CCXs

Not true IF is used at all level of connections including those with the CCX.

Vattila · Aug 12, 2017

Thanks for the replies!

My poll started off great with a majority in my direction (4 cores per CCX). However, it now has flipped with a majority for 6 cores per CCX. The answer is of course known to some (by insiders and those with information from pay-walled reports). However, I'd still like to see better argumentation for going with 6 cores per CCX.

raghu78 said:
The most likely option is 6 core CCX. The primary advantage would be that a Ryzen APU could incorporate a single 6 core CCX and avoid the cross CCX communication penalty.

In benchmarks, we see that the cross-CCX latency does not matter much for most parallel workloads. And for those workloads where latency matters, most seem to run well on 4 cores (e.g. traditional single-threaded workloads, games). Workloads that need low latency over many cores are rare, I guess, and often a sign of poor programming that does not scale well (many threads contending for shared memory/locks).

So why prioritise low cross-CCX latency to the detriment of cross-core latency within a CCX?

Doom2pro said:
If they use 4 core CCX like now, what are they going to do with all that extra space?

Add CCX's. Or other units, such as graphics or accelerators (e.g. my 3+1 idea, 3 CCX and 1 GCX).

moinmoin said:
[6 dies per socket] is simply unworkable and would also be incompatible with Epyc's dual sockets systems that currently link the four dies between each chip.

Exactly. There can be no more than 4 units at any level of the direct-connected hierarchy I outlined — 4 cores, 4 CCXs, 4 dies and 4 sockets. Of course, in the future, on a smaller process, you can integrate more on the die and go to the next level, so that 4 of the current dies are integrated on a single die. Let's call that a cluster (4 CCXs per cluster, 4 clusters per die). Then you would have maximum 4 cores, 4 CCXs, 4 clusters, 4 dies and 4 sockets, for a maximum of 1024 cores in a system. Still, there would be max log4(n) hops between any two cores (for 1024 cores, that is max 5 hops).

SPBHM said:
6 core CCX would make more sense for the desktop CPUs higher than R3

Or 3 x 4-core CCX. Why not? Why 6-core CCX?

krumme said:
For 48 cores on epyc they need to upgrade within the ccx for that to work right?

No. 3 x 4-core CCX per die times 4 dies per socket = 48 cores.

Gideon said:
I still don't understand, why can't they just put 3 CXX'is on a chip. Mapping them to memory should be way easier than interconnecting 6 cores within a CXX

Yup, whether they go with 4-cores or 6-cores per CCX, by sticking with the 8 memory channels per socket for Epyc 2, memory controllers will have to be shared. That's 2 channels per die, which is 2 channels per 12 cores (3 x 4 or 2 x 6) for a 48-core Epyc 2.

Ajay said:
Inserting 2 more cores into the middle of the current floorplan would seem to be the best choice for reorganizing the L3$ interconnects within a CCX.

Why is that the best choice? It is sub-optimal. By increasing the core count per CCX, you can no longer feasibly direct-connect the cores (6*5/2 = 15 links!). Why take the penalty when they can do 3 x 4-core CCX instead?

Tuna-Fish said:
all 4 cores have their own link to each 4 L3 slices, all intercore communication goes through the L3. This means that there are currently 16 links.

I think this is wrong. Currently the L3 controller for one core is direct-connected to the L3 controllers of the other cores (3 ports), for a total of 6 links (4*3/2). See AMD's slide or SpaceBeer's nice drawing in his post in this thread.

William Gaatjes said:
I predict that the professional high end server versions (and some HEDT) will be 8 core CCX in the upcoming years to come.

Again, that would require a sub-optimal interconnect topology. Why not double the number of CCXs instead? Why complicate the CCX? Do you think cross-CCX latency is such a problem that increasing complexity and cross-core latency within the CCX is worth it?

Mopetar said:
I think that they'll stick with a 4-core CCX and make improvements to the core itself,

Agree. There are many things that can be done to improve the 4-core CCX. For example, widen the core, thereby increasing the IPC, then add 4-thread SMT to exploit it through parallelism (16 threads per CCX). This would add threads without complicating the core interconnect.

Mopetar said:
I suspect that when AMD does change the CCX, they go to an 8-core design but won't have direct connections between all cores.

Again, why complicate the interconnect within a CCX when you can double the number of CCXs instead?

scannall said:
Plenty for them to do already. Refining and cleaning up their uarch. Pick the low hanging fruit. Pick up some IPC.

Agree. Probably there are lots of low hanging fruit. Then, for the future, like I mentioned above, e.g. go for a wider core with 4-thread SMT, but keep the CCX simple.

NeoLuxembourg said:
My bad, totally forget about Starship. The 6 core CCX makes sense now.

Or 3 x 4-core CCX. Why does a 6-core CCX make more sense?

Topweasel said:
But they can keep basically the same performance by doing a triangle three CCX's. [...] Whereas if they do 4 or more CCX's they would either have increased latency like the ring bus, or dramatically increase IF connections.

Four CCXs can easily be direct-connected (4*3/2 = 6 links), but beyond that you are right. For more than 4 CCXs on a die, you would need to group by (max) 4 again, e.g. clusters of 4, assuming we are adhering to the hierarchical direct-connect topology I outlined.

William Gaatjes · Aug 12, 2017

Vattila said:
Thanks for the replies!

My poll started off great with a majority in my direction (4 cores per CCX). However, it now has flipped with a majority for 6 cores per CCX. The answer is of course known to some (by insiders and those with information from pay-walled reports). However, I'd still like to see better argumentation for going with 6-cores per CCX.

In benchmarks, we see that the cross-CCX latency does not matter much for most parallel workloads. And for those workloads where latency matters, most seem to run well on 4 cores (e.g. traditional single-threaded workloads, games). Workloads that need low latency over many cores are rare, I guess, and often a sign of poor programming that does not scale well (many threads contending for shared memory/locks).

So why prioritise low cross-CCX latency to the detriment of cross-core latency within a CCX?

Why is that the best choice? It is sub-optimal. By increasing the core count per CCX, you can no longer feasibly direct-connect the cores (6*5/2 = 15 links!). Why take the penalty when they can do 3 x 4-core CCX instead?

Agree. There are many things that can be done to improve the 4-core CCX. For example, widen the core, thereby increasing the IPC, then add 4-thread SMT to exploit it through parallelism (16 threads per CCX). This would add threads without complicating the core interconnect.

Again, why complicate the interconnect within a CCX when you can double the number of CCXs instead?

Agree. Probably there are lots of low hanging fruit. Then, for the future, like I mentioned above, e.g. go for a wider core with 4-thread SMT, but keep the CCX simple.

Four CCXs can easily be direct-connected (4*3/2 = 6 links), but beyond that you are right. For more than 4 CCXs on a die, you would need to group by (max) 4 again, e.g. clusters of 4, assuming we are adhering to the hierarchical direct-connect topology I outlined.

Again, that would require a sub-optimal interconnect topology. Why not double the number of CCXs instead? Why complicate the CCX? Do you think cross-CCX latency is such a problem that increasing complexity and cross-core latency within the CCX is worth it?

A server cpu can be much more expensive and as such a more complex layout can be accounted for.
I agree that server or datacenter software is more likely to be optimized for the NUMA model from AMD.
But in time, Intel is going to have enough experience with their new mesh design to be even further ahead than they are now (Which is not that far, but they do have the financial situation advantage.)
And i think AMD will go to 8 CCX for server cpu because of the competitive pressure from Intel that will come.

That picture of spacebeer is very interesting. But i have always understood that the DF links are used between CCX themselves and not in the interCCX core communication.
I have always understood that the inter core communication happens through the L3 cache directly.
AMD has mentioned that infinity fabric in general can be used on chip as well from die to die. But that is more from gpu to cpu like on an APU.

The picture of space beer has a direct connection Between the cores but it is mentioned in the Anandtech article that the L3 cache latency on average is the same.
This gives to suspect that the cores in a CCX are connected as hops and not directly through a cross connection in the center.

HC28.AMD.Mike%20Clark.final-page-014.jpg

"
Each core will have direct access to its private L2 cache, and the 8 MB of L3 cache is, despite being split into blocks per core, accessible by every core on the CCX with ‘an average latency’ also L3 hits nearer to the core will have a lower latency due to the low-order address interleave method of address generation.
"
But i am not up to date to be honest. Because i am still trying to understand with what this exactly means.

As a sidenote:
I am still wondering why a 3 core like the 1600X has much lower cache latency while the 1600, also a 3 core has not.
It must be a fluke that has been copied from site to site.
And why the 1500x , also with 8MB cache per CCX but only 2 functional cores has not.
I am confused.

I have been reading through this material from AMD themselves to educate myself, but have not found anything conclusive.
http://www.anandtech.com/show/11170...-review-a-deep-dive-on-1800x-1700x-and-1700/9
http://support.amd.com/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf
http://support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf

Vattila · Aug 12, 2017

William Gaatjes said:
A server cpu can be much more expensive and as such a more complex layout can be accounted for.

Still, that is not the issue. Direct connections are optimal. And currently the cores in a 4-core CCX are directly connected. No other topology can beat direct-connections. So to argue for a 6-core CCX you'll have to argue why increasing core-to-core latency is a good thing, since a 6-core CCX will presumably need a sub-optimal topology (e.g. mesh or ring). Also note that, in a chip with more than one CCX, you'll always have cross-CCX latency, even with a 6-core CCX. So to argue for a 6-core CCX on grounds of latency, you'll have to show that it lowers latency for the system overall.

William Gaatjes said:
in time, Intel is going to have enough experience with their new mesh design to be even further ahead than they are now

Direct-connections are optimal. There is no way to beat it. Perhaps a 8-core Intel mesh has lower average latency than a 2 x 4-core CCX. However, if a workload fits in a CCX, it will always have lower core-to-core latency due to direct-connections. So it will be a trade-off. Again, to argue for a 6-core CCX (or a 8-core CCX) you'll have to argue that it lowers latency overall for the workloads that matter.

William Gaatjes said:
That picture of spacebeer is very interesting. But i have always understood that the [Infinity Fabric] links are used between CCX themselves and not in the interCCX core communication.

The kind of link is not material (Infinity Fabric is just a protocol). The point is, as shown by Spacebeer's drawing, that the cores are directly connected. The drawing shows the topology and the numbers of links required to directly connect cores.

William Gaatjes said:
I have always understood that the inter core communication happens through the L3 cache directly.

Like I wrote in my last post, the L3 controller of one core is directly connected to the L3 controllers of the other cores. This is illustrated in the AMD slide I posted earlier. According to that slide, the total shared L3 cache is accessed with interleaved addressing on the low-order bits, so that binary address "...00" goes to the local cache controller, while addresses "...01", "...10" and "...11" will go to the cache controllers of the other cores, respectively. That gives a consistent average access latency. That's how I understand it.

Edit: Note that this address interleaving scheme will not work for 6 cores.

maddie · Aug 12, 2017

William Gaatjes said:
A server cpu can be much more expensive and as such a more complex layout can be accounted for.
I agree that server or datacenter software is more likely to be optimized for the NUMA model from AMD.
But in time, Intel is going to have enough experience with their new mesh design to be even further ahead than they are now (Which is not that far, but they do have the financial situation advantage.)
And i think AMD will go to 8 CCX for server cpu because of the competitive pressure from Intel that will come.

That picture of spacebeer is very interesting. But i have always understood that the DF links are used between CCX themselves and not in the interCCX core communication.
I have always understood that the inter core communication happens through the L3 cache directly.
AMD has mentioned that infinity fabric in general can be used on chip as well from die to die. But that is more from gpu to cpu like on an APU.

The picture of space beer has a direct connection Between the cores but it is mentioned in the Anandtech article that the L3 cache latency on average is the same.
This gives to suspect that the cores in a CCX are connected as hops and not directly through a cross connection in the center.

"
Each core will have direct access to its private L2 cache, and the 8 MB of L3 cache is, despite being split into blocks per core, accessible by every core on the CCX with ‘an average latency’ also L3 hits nearer to the core will have a lower latency due to the low-order address interleave method of address generation.
"
But i am not up to date to be honest. Because i am still trying to understand with what this exactly means.

As a sidenote:
I am still wondering why a 3 core like the 1600X has much lower cache latency while the 1600, also a 3 core has not.
It must be a fluke that has been copied from site to site.
And why the 1500x , also with 8MB cache per CCX but only 2 functional cores has not.
I am confused.

I have been reading through this material from AMD themselves to educate myself, but have not found anything conclusive.
http://www.anandtech.com/show/11170...-review-a-deep-dive-on-1800x-1700x-and-1700/9
http://support.amd.com/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf
http://support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf

From your first link.

William Gaatjes · Aug 12, 2017

Yes, i can edit again.
Line of text here under should be :

I should note, that in my opinion the next consumer zen should have more DF links, since that is where the thread migrating benefits most from in comparison to a monolithic setup alike Intel has.

Vattila said:
Still, that is not the issue. Direct connections are optimal. And currently the cores in a 4-core CCX are directly connected. No other topology can beat direct-connections. So to argue for a 6-core CCX you'll have to argue why increasing core-to-core latency is a good thing, since a 6-core CCX will presumably need a sub-optimal topology (e.g. mesh or ring). Also note that, in a chip with more than one CCX, you'll always have cross-CCX latency, even with a 6-core CCX. So to argue for a 6-core CCX on grounds of latency, you'll have to show that it lowers latency for the system overall.

Well, Intel is still the biggest player and their architecture is the deciding factor to optimize software for and will remain so for some time. We can see this in several benchmarks.
A best of both worlds is an 8 core CCX connected with more DF links for higher inter CCX communication and more L3 cache of course.
As a sidenote: The big question is of course, is a single 64bit memory controller enough for one 8 core CCX.

But anyway, then AMD has a much stronger position.
As seen in benchmarks, the L3 read and write is not that fast in comparison to Intels L3 read and write speeds.
But AMD has a huge L3 cache copy speed. And that is in my opinion to facilitate the inter core communications.
If AMD can maintain this copy speed with the 7nm process and more 8 cores per CCX, there is not an issue.

Direct-connections are optimal. There is no way to beat it. Perhaps a 8-core Intel mesh has lower average latency than a 2 x 4-core CCX. However, if a workload fits in a CCX, it will always have lower core-to-core latency due to direct-connections. So it will be a trade-off. Again, to argue for a 6-core CCX (or a 8-core CCX) you'll have to argue that it lowers latency overall for the workloads that matter.

Of course direct connections are best, and the big question is of course : Will a workload always fit in that cache. I doubt that.
And besides, the whole issue is that threads are migrated from core to core as several kinds of software is running and that is where AMD is behind Intel. One could force to run the threads on one CCX but that can also cause lower utilization of the processer as a whole for as far as i know. Server cpu can be running multiple virtual machines that run programs with lots of threads. And that is where the inter CCX communication hurts. But then again, faster and more wider links between the CCX could alleviate that problem enough that that no longer matters and a 4 core CCX would remain to be enough. Infinity fabric is advertised to have a maximum bandwidth of 512GB/sec. Maybe only 1/4 of speed as DF transfer speed between CCX, that would be enough i guess with current L3 read/write speeds. And hide the DF latency enough that it no longer matters. But there will be a point of diminishing returns, there always is. If anybody could come up with some calculations, i am interested.

The kind of link is not material (Infinity Fabric is just a protocol). The point is, as shown by Spacebeer's drawing, that the cores are directly connected. The drawing shows the topology and the numbers of links required to directly connect cores.

Of course infinity fabric is a protocol but it is also a physical implementation.
And after doing some more reading, it does seem that the scalable data fabric is also used to connect the different L3 caches. But that still does not explain the average latency explanation between L3 caches. I cannot understand that yet.

Like I wrote in my last post, the L3 controller of one core is directly connected to the L3 controllers of the other cores. This is illustrated in the AMD slide I posted earlier. According to that slide, the total shared L3 cache is accessed with interleaved addressing on the low-order bits, so that binary address "...00" goes to the local cache controller, while addresses "...01", "...10" and "...11" will go to the cache controllers of the other cores, respectively. That gives a consistent average access latency. That's how I understand it.

Edit: Note that this address interleaving scheme will not work for 6 cores.

[/QUOTE]

I agree on this that they are all connected. And with respect to your edit, An 8 core CCX can have a similar address interleaving since 8 is 2^3.

I have a lot of information bookmarked, might add this as well:
https://www.nextplatform.com/2017/07/12/heart-amds-epyc-comeback-infinity-fabric/
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

maddie said:
From your first link.
I have seen that, but there is no proper explanation.
Can be marketing talk.

William Gaatjes · Aug 12, 2017

Forum software is acting up. I cannot edit my post.

maddie said:
From your first link.

I have seen that, but there is no proper explanation.
Can be marketing talk.

maddie · Aug 12, 2017

William Gaatjes said:
Forum software is acting up. I cannot edit my post.

I have seen that, but there is no proper explanation.
Can be marketing talk.

Pc Perspective latency test show identical latency between the cores on a single CCX.

This means that there is an equal travel time between all four L3 caches. There is no 1 hop and 2 hops, so the cross linkage is correct. Each L3 is directly linked to other L3 caches in a CCX.

Ajay · Aug 12, 2017

Vattila said:
Why is that the best choice? It is sub-optimal. By increasing the core count per CCX, you can no longer feasibly direct-connect the cores (6*5/2 = 15 links!). Why take the penalty when they can do 3 x 4-core CCX instead?

I just thought the impact would be lower by doing it that way (less modification to the existing logic, but non-trivial nonetheless). Ultimately, AMD will take a penalty either way, but yes, the design & verification penalty is higher for a 6 core cluster. At some point, AMD will move past 4 cores/CCX, it probably does make sense for them to wait till they go w/8 core CCXs - unless they plan on grouping based on a 2x2 tile for the future, as mentioned. As discussed above, a bifurcation of the server and client cores will make sense at some point (once certain unit sales are reached in each segment and the overall R&D budget increases). I supposed, given where AMD was in $$s and manpower when this design decision was made, it's most likely 3x4. Sadly, I'm not willing to pony up $1K for Semiaccurate.com to find out what decision AMD made - so I'll have to wait.

William Gaatjes · Aug 12, 2017

maddie said:
Pc Perspective latency test show identical latency between the cores on a single CCX.

This means that there is an equal travel time between all four L3 caches. There is no 1 hop and 2 hops, so the cross linkage is correct. Each L3 is directly linked to other L3 caches in a CCX.

Well, i agree with you both.
It does seem to be so according to this graph.
https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet

William Gaatjes · Aug 12, 2017

There is still something i do not understand. From the reviews it is mentioned that data goes directly to L2 from main memory and that the L3 is only used as victim cache for expelled data from L2.
Then why, is the core to core connection from L3 to L3 ?
It should be from L2 to L2. What am i missing ?
Then that picture with the links between L3 is kind of wrong.

Ajay · Aug 12, 2017

William Gaatjes said:
There is still something i do not understand. From the reviews it is mentioned that data goes directly to L2 from main memory and that the L3 is only used as victim cache for expelled data from L2.
Then why, is the core to core connection from L3 to L3 ?
It should be from L2 to L2. What am i missing ?
Then that picture with the links between L3 is kind of wrong.

L2$ tags are pushed up to L3$

William Gaatjes · Aug 12, 2017

Ajay said:
L2$ tags are pushed up to L3$

Aha, If i understand correctly : In the tag the memory address is stored.
So in core 1 L3 the tag is compared with the address core 2 wants to access, then core 2 knows the data stored at the memory address is in the L2 of core 1. and then access the core 1, L2.
Or am i wrong ?
Complex matter to keep all the caches coherent.

maddie · Aug 12, 2017

William Gaatjes said:
There is still something i do not understand. From the reviews it is mentioned that data goes directly to L2 from main memory and that the L3 is only used as victim cache for expelled data from L2.
Then why, is the core to core connection from L3 to L3 ?
It should be from L2 to L2. What am i missing ?
Then that picture with the links between L3 is kind of wrong.

AFAIK the L3 cache is really seen as one 8MB unit made up of 4 pieces. Why segmented? I don't know and maybe someone with the knowledge can explain. Power usage? The intra L3 links allow each core equal access to the whole L3.

A single thread can use the full 8MB even if the other 3 cores are idle.

Ajay · Aug 12, 2017

Yeah, that's pretty much it. It's a cloudy Saturday afternoon and I'm to lazy to look up the exact cache tag schema for Zeppelin.

iBoMbY · Aug 12, 2017

My guess is, Zen2 will look similar to this:

William Gaatjes · Aug 12, 2017

maddie said:
AFAIK the L3 cache is really seen as one 8MB unit made up of 4 pieces. Why segmented? I don't know and maybe someone with the knowledge can explain. Power usage? The intra L3 links allow each core equal access to the whole L3.

A single thread can use the full 8MB even if the other 3 cores are idle.

Well, that is where Vatilla made a point. Which i still have difficulty to understand. The L3 addresses are interleaved.It is from a simple view just a ram which must be addressed .
The last 2 bits specify which 2MB bank is selected that is designated by design to a single core of the CCX. I think...
I guess the address interleaving makes decoding maybe more easy to keep the cache coherent ?
I am really reaching here.

William Gaatjes · Aug 12, 2017

Ajay said:
Yeah, that's pretty much it. It's a cloudy Saturday afternoon and I'm to lazy to look up the exact cache tag schema for Zeppelin.

Aha, thank you.
It is funny that we here on the forum try to understand how a design works that extremely bright minds have developed in several years with techniques that are developed since the 80s / 90s when cpus became much faster than main memory and the cpus were endlessly stalling and waiting.

I have this in my bookmarks :
Still interesting read, though:
https://www.extremetech.com/extreme...-why-theyre-an-essential-part-of-modern-chips

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Lifer

Lifer

Diamond Member

Golden Member

Lifer

Senior member

Diamond Member

Diamond Member

Senior member

Lifer

Senior member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Member

Lifer

Lifer