Speculation: The CCX in Zen 2

Elixer · Aug 14, 2017

Was it ever verified that IF is an offshoot of freedom fabric?
Upping the clocks is one way to make things faster, sure, but, there are other ways as well, and FF had 1.28 Terabits/sec bandwidth.
IF only has 42Gb/s bidirection.

The current crop of mobos are good through at least Zen3.

turtile · Aug 14, 2017

maddie said:
APU is a different layout to the pure CPU design.
Construction cores, due to their underperformance, and all derivatives, were abandoned to concentrate on Zen. It's fairly certain that AMD will be sticking with Zen designs for a while.
Ryzen already can use DDR4 3600. We already have DDR4 4800 speeds, admittedly as a very expensive item. Might we have DDR4 5000+ by mid-late 2018? Can it work in motherboards?
A large benefit of higher ram speeds is the IF latency reduction. Might they increase the clock of IF relative to memory?

All speculation based on assuming that present motherboards will be usable with Zen2.

edited to change DDR4 speeds [increased]

It still shares the same CCX, just a different die with a GPU block. AMD has already said that 7nm is hard to design - doesn't make sense to create two different CCX designs for 7nm. And I doubt we will see a big bump is registered DDR4 for the server market by 2019.

Vattila · Aug 14, 2017

turtile said:
Won't AMD need to add another memory channel if they add another CCX/more cores?

Good point. You need to feed the beast. Adding cores makes little sense if the socket is starved. I'm not sure how much leeway they have to improve memory bandwidth in the current socket, but I guess there is untapped potential — low-hanging fruit — when it comes to utilisation, considering the architecture is so new.

For my 3+1 idea (3 CCXs and 1 GCX), usage of HBM on interposer (2.5D) or package (2.1D) comes to mind. This is something we know they are aiming for, as seen in roadmaps and research reports (ref. DOE's Exascale Computing Project).

See AMD Researchers Eye APUs For Exascale.

Vattila · Aug 14, 2017

maddie said:
A large benefit of higher ram speeds is the IF latency reduction. Might they increase the clock of IF relative to memory?

The correlation between RAM and Infinity Fabric speed is there to ensure the interconnect has sufficient bandwidth to carry the traffic from the memory controllers. It does show that there is room to improve interconnect latency substantially. Double-clocking comes to mind.

Aside: According to Tom's Hardware's testing, going from 1333 to 3200 MT/s cuts cross-CCX latency by 50% in AMD Ryzen. Interestingly, faster RAM increases core-to-core latency in Intel Skylake-X (mesh topology).

Vattila · Aug 14, 2017

jpiniero said:
The thing I don't get it why the memory speed even matters on the core latency.

See my previous post.

swilli89 · Aug 14, 2017

Arachnotronic said:
You don't want to increase latencies. Do you mean reduce instruction latencies?

I should have wrote "improve" ie increase frequencies while reducing latencies. 🙄

William Gaatjes · Aug 14, 2017

jpiniero said:
The thing I don't get it why the memory speed even matters on the core latency.

I think it is all different building blocks connected together. But all these building blocks have very likely their own clock domains.
The problem with using different clock domains is to keep everything working and that data is properly synchronized so that no data is lost. This is maybe why DF is locked to a ratio of the DDR memory clock.
Here is an interesting article about such issues.

http://www.eetimes.com/document.asp?doc_id=1276114
http://chipdesignmag.com/display.php?articleId=32&issueId=5

William Gaatjes · Aug 14, 2017

Schmide said:
Everything in computing is a balance of complexity vs speed. IBMs power 8 seems to use the most efficient core complex of 3. Note - AMD and IBM worked closely together in the past. I think AMDs current design reflects quite a bit on the Power8 arch.

Intel on the other hand used a ring bus until Skylake-SP/X.

The ring bus has an advantage of relatively flat latency up to 8 cores (~80ns). However, it is slower than AMDs 4 ccx (~40ns), yet faster than IF (~90-140ns). However once the ring bus was extended to 16 or more cores it exhibits the same distance type latency that AMD has. I do not know the numbers.

So Intel is going with a mesh bus. Again I do not know the numbers but I would surmise it would be a nearest neighbor factor. The farther you go the more you incur.

Oh I'm sure they use parallel hardware searches.

The nature of a cache is you have a block of memory, you subdivide it by way and then map memory into each block. What we have seen the less ways you have, the faster the cache can be. The same can be said for cache size. AMD has gone a step further and subdivided the cache per core. I would imagine that this has the effect of making it like a 4x16 = 64 way cache with many restrictions. The most notable being that each core can only write its its own victim area.

The whole nature of this speculation is - What is the optimal complexity to run at the fastest speed.

I see. 🙂

There is something that confuses me about threadripper. From the varies technical articles, the zeppelin die for the ryzen 3,5 and 7 series have a inter CCX communication of 22GB/s @ DDR2666.
And the communication from CCX to the outside world is also through the Data fabric @ 22GB/s.
But the link between the dies is in threadripper 102GB/s.
And that is where i get confused. What is the use of having 102GB/s links between the dies if the maximum throughput in the on die data fabric is only 22GB/s ?
What am i not seeing here ?
Are the zeppelin dies used for threadripper another stepping with higher DF bandwidth ?
(I only had 4 hours of sleep last night and had to work today, i am not really that sharp at the moment) 🙁

About the 22GB/s data fabric:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-32
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

moinmoin · Aug 14, 2017

William Gaatjes said:
There is something that confuses me about threadripper. From the varies technical articles, the zeppelin die for the ryzen 3,5 and 7 series have a inter CCX communication of 22GB/s @ DDR2666.
And the communication from CCX to the outside world is also through the Data fabric @ 22GB/s.
But the link between the dies is in threadripper 102GB/s.
And that is where i get confused. What is the use of having 102GB/s links between the dies if the maximum throughput in the on die data fabric is only 22GB/s ?
What am i not seeing here ?
Are the zeppelin dies used for threadripper another stepping with higher DF bandwidth ?
(I only had 4 hours of sleep last night and had to work today, i am not really that sharp at the moment) 🙁

About the 22GB/s data fabric:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-32
https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises

My guess would be that the 22GB/s is the amount allocated specifically for one inter-CCX interconnect, and with 4 CCXes you have 6 of them, 4 of them going between the two dies. Additionally the 2 CCXes per die are not the only parts of the dies that needs to be connected, the control fabric and potentially other parts of the uncore may need to exchange data as well.

https://en.wikichip.org/wiki/amd/microarchitectures/zen
Though your question is definitely a good one, the 102GB/s are also significantly more than Epyc got for die-to-die bandwidth (42GB/s).

Schmide · Aug 14, 2017

I found a good article on Epyc that details what happens between two multi-chip-modules.

https://www.nextplatform.com/2017/07/12/heart-amds-epyc-comeback-infinity-fabric/

Vattila · Aug 14, 2017

William Gaatjes said:
What is the use of having 102GB/s links between the dies if the maximum throughput in the on die data fabric is only 22GB/s ?

I think it has to do with oversubscribing the needed bisection bandwidth. See this video, recorded during the EPYC™ Processor Tech Day held June 19, 2017, in which AMD Corporate Fellow Gerry Talbot discusses how the AMD Infinity Fabric works in EPYC processors:

Vattila · Aug 14, 2017

Schmide said:
I found a good article on Epyc that details what happens between two multi-chip-modules.

Great article. William Gaatjes linked to it as well in one of his earlier posts in this thread. I've edited my original post and included a link.

french toast · Aug 15, 2017

I want to see a 4 core CCX, but with each core significantly wider and using 4x SMT.
Double l2 cache and 50℅ more L3.
256bit AVX, improved IF.

Could be pretty interesting 🙂.

Topweasel · Aug 15, 2017

french toast said:
I want to see a 4 core CCX, but with each core significantly wider and using 4x SMT.
Double l2 cache and 50℅ more L3.
256bit AVX, improved IF.

Could be pretty interesting 🙂.

Or I want a completely different arch with the same basic design process. I mean you are asking for a lot, wider, native AVX2, more L2 and L3, and a IF redo.

I mean in a way this would be a good improvement and a probable step for AMD. But that is a lot of change just 2 years in.

Ancalagon44 · Aug 15, 2017

I said 4 cores per CCX because I don't think AMD will scale up again in terms of cores for a while. 16 cores in ThreadRipper for instance is a lot of cores, and probably the right amount for a workstation at the moment.

I think Zen 2 will focus on IPC increases, increasing clockspeed, improving memory compatibility and enhancing Infinity Fabric.

LightningZ71 · Aug 15, 2017

I thought that the increase in the Die-Die bandwidth in Threadripper was that inter-die links that would have gone to Dies 3 and 4 in Epyc were instead linked between Dies 1 and 2 in Threadripper, resulting in a major increase in die-die bandwidth between the remaining two dies.

As for the thread topic, I still can't see AMD tearing up a lot of the processor socket to add more CCXs to the Zepplin Dies in Zen+ or Zen2. In 2P EPYC, there are direct CCX to CCX links between the 2 processors. This means that if the number of CCX units per zepplin die changes, those links are going to have to increase in number. That will require a redo of the processor socket, which AMD specifically said that they were not going to do for several years. This leads me to believe that either one of two things will happen:

1) The number of cores in a die will stay the same and they will focus on improving IPC with expanded caches, process improvements, and with tweaks to performance critical circuits in the cores
2) The number of cores will increase via expanding the number of cores in a CCX. This can be accomplished by abandoning the direct connect scheme that currently exists between the individual cores and instead, changing that to an integrated data switch in each CCX. You can then expand to 6 cores, direct the intercore connection links to the data switch, and have it do the heavy lifting of establishing the inter core connections and forwarding the information. This will unfortunately increase inter-core latency a bit, but will still be much faster than leaving the CCX. We're also talking about something that will be running a faster performing process as well, meaning that overall CCX latency as measured in ns delay will be reduced from current levels if the current layout is retained. This could result in a 6 core CCX that has similar average latency to the existing 4 core CCX as measured in ns. It is not insurmountable to expand the CCX. As long as you leave the cores themselves alone, save for process node tweaks and critical path cleanup, you can keep development costs down.

There is one other possibility. Instead of going to 6 core CCXs, or going for a wholesale relayout of the existing die, they could instead use the 7nm node to go with 4 CCX units and have the CCXs be paired with respect to die to die and processor to processor communication links. The pairs of CCX units are joined with a multi-port high speed data switch that handles L3 cache access, CCX exterior data links, etc. To the rest of the die, it looks and acts just like the existing single CCX (save for some core addressing control logic). The CCX units themselves stay the same, save for process tweaks, addressing control logic changes, etc. Intra-ccx latency stays consistent. Inter-ccx latency gets bifurcated between paired CCX latency and exterior CCX latency. NUMA coding will need a little tweaking, but, by that time, more software should be NUMA aware. This will essentially turn an existing RYZEN processor into a low I/O Threadripper, and turn Threadripper into a faster 1-P EPYC with lower memory bandwidth. EPYC would, I think, be a mess with massive inter-die bottlenecks unless they can find a way to make a major improvement in inter-die communications bandwidth.

french toast · Aug 15, 2017

Topweasel said:
Or I want a completely different arch with the same basic design process. I mean you are asking for a lot, wider, native AVX2, more L2 and L3, and a IF redo.

I mean in a way this would be a good improvement and a probable step for AMD. But that is a lot of change just 2 years in.

well I just feel Zen is not wide enough in the long run with frequency gains from process probably drying up past 7nm, I also meant just tweaking IF for better characteristics.
AVX 256 is recommended coming sooner or later so why not zen2?
Remember icelake is coming and that is bound to be wider.

Topweasel · Aug 15, 2017

french toast said:
well I just feel Zen is not wide enough in the long run with frequency gains from process probably drying up past 7nm, I also meant just tweaking IF for better characteristics.
AVX 256 is recommended coming sooner or later so why not zen2?
Remember icelake is coming and that is bound to be wider.

Well will see but there is strong possibility that Ryzen 7nm is twelve cores so even if they still have the same AVX support their actual AVX2 compute power will still be the same as Intel. As for being wider AMD just spent the last 5 years developing Zen, they aren't going to redo it just after 2 years. Look at Intel no major changes for 10 years.

I expect 7nm++ or 5nm is when AMD will target the next major change.

french toast · Aug 15, 2017

Topweasel said:
Well will see but there is strong possibility that Ryzen 7nm is twelve cores so even if they still have the same AVX support their actual AVX2 compute power will still be the same as Intel. As for being wider AMD just spent the last 5 years developing Zen, they aren't going to redo it just after 2 years. Look at Intel no major changes for 10 years.

I expect 7nm++ or 5nm is when AMD will target the next major change.

I Admit I don't think it is likely to get a much bigger core, but just what I would like to see and what is possible.
Going wider with 4x smt gives a nice possibility on 7nm imo.

eek2121 · Aug 16, 2017

french toast said:
well I just feel Zen is not wide enough in the long run with frequency gains from process probably drying up past 7nm, I also meant just tweaking IF for better characteristics.
AVX 256 is recommended coming sooner or later so why not zen2?
Remember icelake is coming and that is bound to be wider.

I doubt we are going to see much movement past 7nm short of a tech miracle. What we will likely see is refined 7nm, but anything advertised as below 7nm is probably just marketing. I don't think we'll have issues with Zen on 7nm. If AMD keeps up the R&D game, we'll have new iterations of Zen that should be competitive with anything Intel can throw out there.

EDIT: To add a bit more clarification, we HAVE created transistors smaller than 7nm. However, smaller than about 5nm requires some seriously exotic manufacturing or materials. That is to say NOTHING of turning those transistors into an actual, complex product such as a CPU. There are some seriously HARD challenges to overcome to get below 7nm.

LightningZ71 · Aug 16, 2017

4xSMT is not a trivial change to the core. It will require a lot of work on registers (need two additional sets), instruction scheduler, and lots of extra minor resources that are associated with core state. Also, you'll be effectively halving all of your caches with respect to thread count, meaning that everything will become more memory starved and threads will lock more consistently waiting for I/O transactions. So, while you might be able to improve overall instruction throughput by another few percent, individual threads will be contending much more often for I/O resources than presently, reducing overall system responsiveness when under load. For the few processors out there that can do 4 threads per core, they were designed at their very basest levels for that sort of thing. There is far lower fruit hanging down by just adding paired CCXs or more cores to each CCX and taking advantage of the process shrink to keep the actual die the same size. Going with paired CCXs for 16 total cores would give the same thread count increase, but also double the L1 and L2 in the process, keeping both thread/cache ratios the same. They could use the shrink to increase the L3 by 50% as well and still likely keep within the 14nm die size boundaries. Going with 6 core CCXs gives you 50% more cores, keeps the cache ratios the same, and gives you enough room to double the L3 to 16mb. Both of those steps keeps the cores largely untouched, which is likely the biggest money saving step of all.

LightningZ71 · Aug 16, 2017

Getting much below 5nm gets you into the region that features are reduced down to single digit numbers of atoms in some cases. Just a rough number, at a 5.431 angstrom lattice constant for silicon, 7nm should be roughly 13 atoms across. Reducing that by 20-25% should put that number down to around 9ish atoms for a 5nm feature size. You get to where you can physically no longer manipulate things at that size successfully. I suspect that someone will commercialize a mass produced 5nm process node that will be a performance leader, but not a power efficiency king. It is theoretically possible to produce a 3nm circuit (about 6 to 7 atoms across?), but it may never be mass producible and might only be used in very specific low volume use cases where going for the absolute most compact physical size becomes important. From what I've read, though, as you get down past 7nm, leakage problems grow drastically. Power efficiency goes out the window. Is it possible to find a substrate that's more amenable to smaller feature sizes? Perhaps, but there's a reason that its been silicon for a long time as it has some very desirable qualities. Either way, you can't move up the periodic table, and there's not a whole lot of moving down for smaller atoms.

It's going to take a long time to get below 7nm in a commercially viable way. We may never get below 5nm in a commercially viable way. Where do you go from there? Chip dies will begin to grow as processors are made wider and wider, but that becomes a matter of diminishing returns. They can go more into 3D and stack more layers for deeper caches, but then thermal management becomes a problem, even with vertical heat channels integrated into the design.

french toast · Aug 16, 2017

LightningZ71 said:
4xSMT is not a trivial change to the core. It will require a lot of work on registers (need two additional sets), instruction scheduler, and lots of extra minor resources that are associated with core state. Also, you'll be effectively halving all of your caches with respect to thread count, meaning that everything will become more memory starved and threads will lock more consistently waiting for I/O transactions. So, while you might be able to improve overall instruction throughput by another few percent, individual threads will be contending much more often for I/O resources than presently, reducing overall system responsiveness when under load. For the few processors out there that can do 4 threads per core, they were designed at their very basest levels for that sort of thing. There is far lower fruit hanging down by just adding paired CCXs or more cores to each CCX and taking advantage of the process shrink to keep the actual die the same size. Going with paired CCXs for 16 total cores would give the same thread count increase, but also double the L1 and L2 in the process, keeping both thread/cache ratios the same. They could use the shrink to increase the L3 by 50% as well and still likely keep within the 14nm die size boundaries. Going with 6 core CCXs gives you 50% more cores, keeps the cache ratios the same, and gives you enough room to double the L3 to 16mb. Both of those steps keeps the cores largely untouched, which is likely the biggest money saving step of all.

for desktop and gaming we need much more st power, I can't really see that happening once we get to 5ghz, maybe the odd percentage point.
I would liked to have seen something about more forward thinking and exotic, ryzen is awesome all things considered, but I would like to see them do Something that takes us a major step forward.
Adding loads and loads of cores is only really good for a few tasks and is the easy way of adding performance, ST is much harder.

french toast · Aug 16, 2017

eek2121 said:
I doubt we are going to see much movement past 7nm short of a tech miracle. What we will likely see is refined 7nm, but anything advertised as below 7nm is probably just marketing. I don't think we'll have issues with Zen on 7nm. If AMD keeps up the R&D game, we'll have new iterations of Zen that should be competitive with anything Intel can throw out there.

EDIT: To add a bit more clarification, we HAVE created transistors smaller than 7nm. However, smaller than about 5nm requires some seriously exotic manufacturing or materials. That is to say NOTHING of turning those transistors into an actual, complex product such as a CPU. There are some seriously HARD challenges to overcome to get below 7nm.

Yea my point exactly, past a certain point there will be little gains to be had, we are not far off, so wider cores with 4-SMT would have been great by the time we hit 7nm.

ksec · Aug 16, 2017

eek2121 said:
Early AMD slides stated Ryzen using '14nm+' for 2018. At the time (can't find the exact link these days, but I remember reading it at the time), AMD stated that there was another 10-15% IPC increase to be had by taking care of low hanging fruit. In addition, the 14nm+ was supposed to allow for higher clocks. A couple review sites claimed that AMD told them to expect a 15% IPC increase. I spoke with someone at AMD later on and he stated the target would actually be closer to 20%. That was a few months ago. He could have been pulling my leg, of course, especially since he was likely under NDA. That being said, I doubt the CCX size itself is going to change at all.

2019 will see 7nm Ryzen chips. You won't see much in the way of change for the die shrink. More than likely it will be the Zen/14nm+ chip with faster clocks. This is my speculation only. You won't see a CCX change until the introduction of a new socket, if at all.

You see that is the problem. Those "early" "leaked" "unconfirmed" slides are contradicting to the newest "leaked" "unconfirmed" slides. And we all take it as a fact.

I am not aware of AMD publicly stating there will be a 48 Core Starship Server CPU. And therefore the only reason why we are polling for 6 Core CCX is simply because of this illusionary 48 Core CPU.

Purely speculating, those earlier slides make much more sense. 7nm in early 2018 ( From those 2nd leaked slides ) is simply too early. It make much more sense to have 14nm++ along side with improved IPC Zen 2.

And 4 Core make so much more sense for consumers. That is up to 8 Thread. For the past decades we haven't seen another killer apps that requires a leap of computer resources. A Core 2 Duo 10 years ago, with enough RAM and SSD will still be very speedy. I believe the 4 Core Entry, 8 / 16 Core for Middle and Semi Pro. And those who want infinite resources I think there will be 8 Core CCX with up to 32 Core per CPU.

Speculation: The CCX in Zen 2

How many cores per CCX in 7nm Zen 2?

4 cores per CCX (3 or more CCXs per die)

6 cores per CCX (2 or more CCXs per die)

8 cores per CCX (1 or more CCXs per die)

Lifer

Senior member

Senior member

Senior member

Senior member

Golden Member

Lifer

Lifer

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Platinum Member

Senior member

Senior member

Senior member