Speculation: Ryzen 4000 series/Zen 3

A/// · Aug 24, 2020

I think given current global circumstances, the wait for the next Zen processor is excruciatingly painful because most of us are stuck at home and really can't go out. If this were any other year, time would fly faster. I too have wondered just how much weight his words carry, @DisEnchantment! I'm keeping my expectations to minimal levels as much as I can.

moinmoin · Aug 24, 2020

DisEnchantment said:
Current MCA banks in Family 17h

How does that work with Zen 2 MCM packages? As we can see the amount of L3 blocks is 8, fitting for a single chiplet with 8 cores and their L3 slices, but not sufficient for multiple of them. Does Zen 2 have separate MCA banks per chiplet instead per CPU, and Zen 3 may be unifying that as well?

DisEnchantment · Aug 24, 2020

moinmoin said:
How does that work with Zen 2 MCM packages? As we can see the amount of L3 blocks is 8, fitting for a single chiplet with 8 cores and their L3 slices, but not sufficient for multiple of them. Does Zen 2 have separate MCA banks per chiplet instead per CPU, and Zen 3 may be unifying that as well?

The banks are logical and in the end they produce status which are present in the MCA registers. Some are per core and some are global.
A thread on a core can only see its own registers.
But it is not trivial change to wire up 64 banks and bigger question is what is the necessity, what is the drastic change that deem this necessary.
Previous MCA change which I posted before also indicate AMD has a new LS unit, new L3 subsystem.
So far BD to Zen did not change as much in the kernel and in the Manuals.

maddie · Aug 24, 2020

eek2121 said:
They have to solve the latency issue first.

Don't think that's as big an issue as you claim. A long time ago, I remember hearing a GPU designer claim that they could work around latency issues fairly easily. I don't think there's a lot of branching as compared to most CPU programs.

trexfromouterspace · Aug 25, 2020

DisEnchantment said:
Everytime I go find new changes in the manuals and the kernel changes, I keep wondering about Forrest's comment about Zen3.

Completely new architecture means completely new architecture. If it didn't it wouldn't be a completely new architecture.

Vattila · Aug 25, 2020

Tuna-Fish said:
I have seen and read that, at some point I just stopped bothering to respond.

I have learnt to never attribute accuracy to bombastic statements, especially those lacking sound argument. So thanks for arguing your case. However, please be less dismissive and more tolerant of another point of view (even if you experience irritation at my ignorance, as the case may be).

You may be a world class circuit designer for all that I know, while I am certainly not (I have just a computing science degree with a very basic education in this area). I am sorry if this further annoys you, but I am still not convinced by your arguments.

Where our interpretations on topology differ is that you are convinced the L2 controller does the routing, while I think (note the difference) the L3 controller does. Based on the disclosed data we have, I think this makes most sense.

These contrasting views make a dramatic difference on the complexity of the topology. In your case, it means connecting every L2 controller to every L3 controller (4 x 4 = 16), while in my case, it means connecting all the L3 controllers (4 x (4 - 1) / 2 = 6, plus 4 ordinary L2 to L3 connections, for a total of 10). That is a sizable difference for linking up the cores.

Any fully connected scheme does not scale. That you think your scheme does, even to 8 cores for "Zen 3" (8 x 8 = 64 links), twitches the eyebrow. That you think links are low cost (area, power, switch complexity), I find puzzling. The way you downplay this, while overplaying contrived counter-arguments and flat out dismissing contradictory evidence, weakens your argument rather than convincing me.

As @NostaSeronx points out, there is evidence for non-uniform L3 slice latency. Your dismissal is unconvincing. The details of the methodology used to arrive at those measurements are unclear (they may include averaging and second-order effects, such as prefetching and buffering effects), so I do not read much into their magnitude.

PS. For readers confused by this discussion, here is a simple picture of the gist of it. Pick your camp.

jamescox · Aug 25, 2020

Ajay said:
The L2$ controllers aren't a crossbar, they are simple p2p switches. A crossbar would be all 8 L2 controllers being connected to a fully meshed switch which is then connected to each L3 slice. That, or I've completely forgotten what I learned from the EEs while working at an enterprise network hardware company.

There is no time for the signals to go through a crossbar switch. We are talking about a situation where just wire delay is significant. A crossbar switch is made to switch any input to any output simultaneously. That isn’t a simple circuit. The latency would be to high. You don’t need a crossbar switch for direct connection since it is just 1 to many not many to many. There would be some control circuitry, buffers or queues, etc but basically with each core connected directly to all slices, you just need a 1 to 4 multiplexor (very simple) on each slice and a 4 to 1 multiplexor on each core. There is no direct connection between slices. Connections are between cores and slices. The complexity is is determining the location of the cache line and accessing it with good latency, not transferring it. That is a massive simplification, but I think the general idea is correct. Cache design has not been simple in a long time; it is just about the most important part of the chip.

Wide interconnect is very common on chip. Wide connections aren’t that big of a problem. Long interconnect is more of an issue; the parasitic resistance and capacitance slows it down and takes a lot of power. What allows them to use a 32 MB cache is probably that the cache is now small enough to allow the interconnect to be reasonably short. The direct connection probably can’t scale to much larger caches / core counts though. Intel used a ring bus, but that was not that scalable either. I don’t think they ever went more than 10 cores per ring. The higher core count chips with ring buses had 2 or 3 separate rings. The mesh network Intel uses now does provide good latency to any slice, but it only goes up to ~37.5 MB / 28 core and it seems to use a lot of power. It is sending signals long distance at high clock across the chip. AMD’s architecture only sends signals at core clock within the tiny CCX. If you need to send something farther, it is slower, but it is also a lot lower power. You also generally don’t need to do that very often. It is (obviously) much more scalable than intel’s “scalable“ xeons. EPYC processor can have 256MB of L3 cache. You could not provide access to such a large cache in a monolithic manner without blowing up the access latency and the power consumption.

Vattila · Aug 25, 2020

Tuna-Fish said:
[L3 crossbar] is not the conventional interpretation

According to WikiChip: "The CCX itself was designed such that the L3 acts as a crossbar for each of the four cores."

AMD's Zen CPU Complex, Cache, and SMU

A look at AMD's Zen CPU Complex (CCX), a fully independent and modular cluster of up to four cores that are incorporated into a full SoC to form complete products such as their Zeppelin die.

fuse.wikichip.org

Vattila · Aug 25, 2020

Thala said:
both your and Vattilas proposal are topologically the same as far as 4xL3$ is concerned - just drawn differently.

Not quite. It is a good point that both views are functionally equivalent. However, the topology is dramatically simplified by making the routing to distant L3 slices go through the local slice. In effect, it turns one of the switch connection points into an endpoint. This can be illustrated by focusing on a single switch in both schemes.

Bigos · Aug 25, 2020

The diagram below is missing half of the connections between L3 slices. What you draw as a single connection between L3 slices should actually be 2 connections, to go from slice 0 to slice 1 and from slice 1 to slice 0. Unless you assume there is a contention between "remote" L3 slices, which we should be able to see on a test (i.e. if core 0 accesses slice 1 and core 1 accesses slice 0 the throughput should be half of when both cores access their local slices only).

With the 6 between-slice connections doubled, there are the same number of connections in both diagrams, making them equivalent.

eek2121 · Aug 25, 2020

maddie said:
Don't think that's as big an issue as you claim. A long time ago, I remember hearing a GPU designer claim that they could work around latency issues fairly easily. I don't think there's a lot of branching as compared to most CPU programs.

It isn’t possible to just “work around” latency issues. There will always be additional latency with MCM based designs. For instance, Zen 2 latency triples when crossing a CCD barrier.

PCIE itself has also has high latency that GPU vendors have to contend with, though admittedly, PCIE 4.0 helps.

Note that I am not saying that MCM solutions can’t work. I firmly believe that the gaming industry will have a day of reckoning, where they will have to rethink how game engines are designed and developed. It will no longer be possible and/or appropriate for game engines to simply throw frames up as quickly as possible. Certain technologies like VRS combined with a drastically different GPU algorithm and other technologies will push game engines into more of a passive, asynchronous (possibly even callback driven? That’s an interesting idea) workflow.

I am getting a bit off-topic here, however, my core point still stands: MCM-based designs have their own hurdles to deal with. They aren’t a magical fix for everything.

eek2121 · Aug 25, 2020

trexfromouterspace said:
Completely new architecture means completely new architecture. If it didn't it wouldn't be a completely new architecture.

Technically Ryzen was a completely new architecture and it gave us a 52% improvement. *ducks*

EDIT: Can you imagine Zen 3 having 50% higher performance? That would be nuts! Note that I don’t in any way it will happen, but one can dream...

trexfromouterspace · Aug 25, 2020

eek2121 said:
Technically Ryzen was a completely new architecture and it gave us a 52% improvement. *ducks*

EDIT: Can you imagine Zen 3 having 50% higher performance? That would be nuts! Note that I don’t in any way it will happen, but one can dream...

That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Given Forrest's comments and Zen 2's 15% IPC uplift over Zen 1, I'd consider a Zen 2 to Zen 3 IPC uplift below 20% to be disappointing.

eek2121 · Aug 25, 2020

trexfromouterspace said:
That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Given Forrest's comments and Zen 2's 15% IPC uplift over Zen 1, I'd consider a Zen 2 to Zen 3 IPC uplift below 20% to be disappointing.

Yes, because it was a completely new architecture.

DrMrLordX · Aug 25, 2020

I don't think AMD is projecting 50%+ performance improvements in Zen3 except in certain FP workloads. Which means they're probably moving towards 3x or 4x256b FMACs (away from 2x256b).

rainy · Aug 25, 2020

trexfromouterspace said:
That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Not correct - AMD compared Zen (Summit Ridge) to Excavator (2015), Bulldozer was a first gen released on October 2011.
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
https://en.wikipedia.org/wiki/Excavator_(microarchitecture)

exquisitechar · Aug 25, 2020

trexfromouterspace said:
That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Given Forrest's comments and Zen 2's 15% IPC uplift over Zen 1, I'd consider a Zen 2 to Zen 3 IPC uplift below 20% to be disappointing.

>=20% is happening. Apparently, Zen 4 is even bigger. Fun times ahead.

trexfromouterspace · Aug 25, 2020

rainy said:
Not correct - AMD compared Zen (Summit Ridge) to Excavator (2015), Bulldozer was a first gen released on October 2011.
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
https://en.wikipedia.org/wiki/Excavator_(microarchitecture)

Excavator was still a pile of garbage though. Lipstick on a pig if you will.

Thibsie · Aug 25, 2020

trexfromouterspace said:
Excavator was still a pile of garbage though. Lipstick on a pig if you will.

I just love that expression in English.
French isn't funny as that one.

yuri69 · Aug 25, 2020

DisEnchantment said:
Everytime I go find new changes in the manuals and the kernel changes, I keep wondering about Forrest's comment about Zen3.

Zen3 features a different CCX layout implying a new L3 compared to family 17h. This fact alone makes it a different AMD architecture...

amd6502 · Aug 25, 2020

rainy said:
Not correct - AMD compared Zen (Summit Ridge) to Excavator (2015), Bulldozer was a first gen released on October 2011.
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
https://en.wikipedia.org/wiki/Excavator_(microarchitecture)

Yup, huge difference there. Even the difference btwn BD and Piledriver is very significant (BD was really much more of prototype / 0-th gen / or pre-release rather than a real normal 1st gen product.)

XV had quite decent IPC.

Now if XV had an L3 I wonder how it would have compared to Zen1.

Or, perhaps someone can figure out how to benchmark the two with L3 disabled (or cut down to something like 512kb per core, which would even out the cache between the two).

eek2121 · Aug 25, 2020

yuri69 said:
Zen3 features a different CCX layout implying a new L3 compared to family 17h. This fact alone makes it a different AMD architecture...

We are assuming Zen 3 has the concept of a CCX at all. In an ideal world AMD wouldn’t even have an IO die. The chiplets themselves would be completely self contained.

lobz · Aug 25, 2020

Thibsie said:
I just love that expression in English.
French isn't funny as that one.

Why, what is that? Snails on a baguette?

OK that was horrible, I'm outta here.

Vattila · Aug 26, 2020

Bigos said:
[Your topology diagram] is missing half of the connections between L3 slices. What you draw as a single connection between L3 slices should actually be 2 connections, to go from slice 0 to slice 1 and from slice 1 to slice 0. [...] With the 6 between-slice connections doubled, there are the same number of connections in both diagrams, making them equivalent.

From an abstract topology standpoint the two schemes are very different. A link is seen as a single bidirectional connection. But of course, a sparser network will in general have more contention on the links.

The scaling of the capacity of the links depends on second-order design concerns, such as traffic patterns, contention, power budget and so on. The implementation of the links also has many options depending on the design targets — lanes, width, speed, signalling, protocol, to name a few.

It is not a given that a doubling of the capacity of the links is needed. For example, simultaneous L3 requests from different cores may be relatively rare. Or, they may still be sufficiently infrequent to be efficiently interleaved by buffering when they occur. (This may be what lead designer Mike Clark was alluding to when, in the Q&A after his Hot Chips presentation, he was asked how the L3 deals with simultaneous requests, to which he answered "we have buffering around it to handle that".)

NostaSeronx · Aug 26, 2020

Core A L2 sends request for read from L3 to cluster core interface A, the shared memory table says the data isn't in L3 slice A but is in L3 slice D.

Slice L3 A & Slice L3 B are in the same row => +2 for non-local L3 read
Slice L3 A & Slice L3 C are in the same column => +4 for non-local L3 read
Slice L3 A & Slice L3 D aren't in the same row or column => +6 for non-Local L3 read

So, cluster core interface D loads it to a low-latency queue and ships it off to cluster core interface A low-latency queue which is then stored in the L2.

Cluster core interfaces operate like L3 LD/ST units.

A single core loads a line => invalid to other cores.
A couple core loads a line => valid to other cores.

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Diamond Member

Golden Member

Diamond Member

Member

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Member

Diamond Member

Lifer

Senior member

Senior member

Member

Golden Member

Senior member

Senior member

Diamond Member

Platinum Member

Senior member

Diamond Member