Speculation: Ryzen 4000 series/Zen 3

Page 161 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,891
136
Looking at this commit in the Linux kernel

...because future AMD systems will support up to 64 MCA banks per CPU.

MAX_NR_BANKS is used to allocate a number of data structures, and it is
used as a ceiling for values read from MCG_CAP[Count]. Therefore, this
change will have no functional effect on existing systems with 32 or
fewer MCA banks per CPU.

Current MCA banks in Family 17h


1598307866205.png

This really a major architectural change. I wonder how many new blocks are now capable of supporting MCA banks.
They are going to be new and updated blocks because new registers and status fields would need to wired up around the new stuff.
PSP for sure will be there.

1598308071779.png
Are there going to be new decomposable lego blocks. Right now it is a black hole with the leaks from AMD.

Everytime I go find new changes in the manuals and the kernel changes, I keep wondering about Forrest's comment about Zen3.

When asked about what kind of performance gain Milan's CPU core microarchitecture, which is known as Zen 3, will deliver relative to the Zen 2 microarchitecture that Rome relies on in terms of instructions processed per CPU clock cycle (IPC), Norrod observed that -- unlike Zen 2, which was more of an evolution of the Zen microarchitecture that powers first-gen Epyc CPUs -- Zen 3 will be based on a completely new architecture.
Norrod did qualify his remarks by pointing out that Zen 2 delivered a bigger IPC gain than what's normal for an evolutionary upgrade -- AMD has said it's about 15% on average -- since it implemented some ideas that AMD originally had for Zen but had to leave on the cutting board. However, he also asserted that Zen 3 will deliver performance gains "right in line with what you would expect from an entirely new architecture."
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
I think given current global circumstances, the wait for the next Zen processor is excruciatingly painful because most of us are stuck at home and really can't go out. If this were any other year, time would fly faster. I too have wondered just how much weight his words carry, @DisEnchantment! I'm keeping my expectations to minimal levels as much as I can.
 

moinmoin

Diamond Member
Jun 1, 2017
4,967
7,715
136
Current MCA banks in Family 17h
How does that work with Zen 2 MCM packages? As we can see the amount of L3 blocks is 8, fitting for a single chiplet with 8 cores and their L3 slices, but not sufficient for multiple of them. Does Zen 2 have separate MCA banks per chiplet instead per CPU, and Zen 3 may be unifying that as well?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,891
136
How does that work with Zen 2 MCM packages? As we can see the amount of L3 blocks is 8, fitting for a single chiplet with 8 cores and their L3 slices, but not sufficient for multiple of them. Does Zen 2 have separate MCA banks per chiplet instead per CPU, and Zen 3 may be unifying that as well?
The banks are logical and in the end they produce status which are present in the MCA registers. Some are per core and some are global.
A thread on a core can only see its own registers.
But it is not trivial change to wire up 64 banks and bigger question is what is the necessity, what is the drastic change that deem this necessary.
Previous MCA change which I posted before also indicate AMD has a new LS unit, new L3 subsystem.
So far BD to Zen did not change as much in the kernel and in the Manuals.
 

maddie

Diamond Member
Jul 18, 2010
4,767
4,732
136
They have to solve the latency issue first.
Don't think that's as big an issue as you claim. A long time ago, I remember hearing a GPU designer claim that they could work around latency issues fairly easily. I don't think there's a lot of branching as compared to most CPU programs.
 

Vattila

Senior member
Oct 22, 2004
800
1,363
136
I have seen and read that, at some point I just stopped bothering to respond.

I have learnt to never attribute accuracy to bombastic statements, especially those lacking sound argument. So thanks for arguing your case. However, please be less dismissive and more tolerant of another point of view (even if you experience irritation at my ignorance, as the case may be).

You may be a world class circuit designer for all that I know, while I am certainly not (I have just a computing science degree with a very basic education in this area). I am sorry if this further annoys you, but I am still not convinced by your arguments.

Where our interpretations on topology differ is that you are convinced the L2 controller does the routing, while I think (note the difference) the L3 controller does. Based on the disclosed data we have, I think this makes most sense.

These contrasting views make a dramatic difference on the complexity of the topology. In your case, it means connecting every L2 controller to every L3 controller (4 x 4 = 16), while in my case, it means connecting all the L3 controllers (4 x (4 - 1) / 2 = 6, plus 4 ordinary L2 to L3 connections, for a total of 10). That is a sizable difference for linking up the cores.

Any fully connected scheme does not scale. That you think your scheme does, even to 8 cores for "Zen 3" (8 x 8 = 64 links), twitches the eyebrow. That you think links are low cost (area, power, switch complexity), I find puzzling. The way you downplay this, while overplaying contrived counter-arguments and flat out dismissing contradictory evidence, weakens your argument rather than convincing me.

As @NostaSeronx points out, there is evidence for non-uniform L3 slice latency. Your dismissal is unconvincing. The details of the methodology used to arrive at those measurements are unclear (they may include averaging and second-order effects, such as prefetching and buffering effects), so I do not read much into their magnitude.

PS. For readers confused by this discussion, here is a simple picture of the gist of it. Pick your camp.

Zen L3 Interconnect.png
 
Last edited:
  • Like
Reactions: maddie

jamescox

Senior member
Nov 11, 2009
637
1,103
136
The L2$ controllers aren't a crossbar, they are simple p2p switches. A crossbar would be all 8 L2 controllers being connected to a fully meshed switch which is then connected to each L3 slice. That, or I've completely forgotten what I learned from the EEs while working at an enterprise network hardware company.
There is no time for the signals to go through a crossbar switch. We are talking about a situation where just wire delay is significant. A crossbar switch is made to switch any input to any output simultaneously. That isn’t a simple circuit. The latency would be to high. You don’t need a crossbar switch for direct connection since it is just 1 to many not many to many. There would be some control circuitry, buffers or queues, etc but basically with each core connected directly to all slices, you just need a 1 to 4 multiplexor (very simple) on each slice and a 4 to 1 multiplexor on each core. There is no direct connection between slices. Connections are between cores and slices. The complexity is is determining the location of the cache line and accessing it with good latency, not transferring it. That is a massive simplification, but I think the general idea is correct. Cache design has not been simple in a long time; it is just about the most important part of the chip.

Wide interconnect is very common on chip. Wide connections aren’t that big of a problem. Long interconnect is more of an issue; the parasitic resistance and capacitance slows it down and takes a lot of power. What allows them to use a 32 MB cache is probably that the cache is now small enough to allow the interconnect to be reasonably short. The direct connection probably can’t scale to much larger caches / core counts though. Intel used a ring bus, but that was not that scalable either. I don’t think they ever went more than 10 cores per ring. The higher core count chips with ring buses had 2 or 3 separate rings. The mesh network Intel uses now does provide good latency to any slice, but it only goes up to ~37.5 MB / 28 core and it seems to use a lot of power. It is sending signals long distance at high clock across the chip. AMD’s architecture only sends signals at core clock within the tiny CCX. If you need to send something farther, it is slower, but it is also a lot lower power. You also generally don’t need to do that very often. It is (obviously) much more scalable than intel’s “scalable“ xeons. EPYC processor can have 256MB of L3 cache. You could not provide access to such a large cache in a monolithic manner without blowing up the access latency and the power consumption.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
800
1,363
136
[L3 crossbar] is not the conventional interpretation

According to WikiChip: "The CCX itself was designed such that the L3 acts as a crossbar for each of the four cores."

 
  • Like
Reactions: maddie

Vattila

Senior member
Oct 22, 2004
800
1,363
136
both your and Vattilas proposal are topologically the same as far as 4xL3$ is concerned - just drawn differently.

Not quite. It is a good point that both views are functionally equivalent. However, the topology is dramatically simplified by making the routing to distant L3 slices go through the local slice. In effect, it turns one of the switch connection points into an endpoint. This can be illustrated by focusing on a single switch in both schemes.

Zen L3 Interconnect Switches.png
 
Last edited:
  • Like
Reactions: lobz

Bigos

Member
Jun 2, 2019
131
295
136
The diagram below is missing half of the connections between L3 slices. What you draw as a single connection between L3 slices should actually be 2 connections, to go from slice 0 to slice 1 and from slice 1 to slice 0. Unless you assume there is a contention between "remote" L3 slices, which we should be able to see on a test (i.e. if core 0 accesses slice 1 and core 1 accesses slice 0 the throughput should be half of when both cores access their local slices only).

With the 6 between-slice connections doubled, there are the same number of connections in both diagrams, making them equivalent.
 

eek2121

Platinum Member
Aug 2, 2005
2,931
4,027
136
Don't think that's as big an issue as you claim. A long time ago, I remember hearing a GPU designer claim that they could work around latency issues fairly easily. I don't think there's a lot of branching as compared to most CPU programs.

It isn’t possible to just “work around” latency issues. There will always be additional latency with MCM based designs. For instance, Zen 2 latency triples when crossing a CCD barrier.

PCIE itself has also has high latency that GPU vendors have to contend with, though admittedly, PCIE 4.0 helps.

Note that I am not saying that MCM solutions can’t work. I firmly believe that the gaming industry will have a day of reckoning, where they will have to rethink how game engines are designed and developed. It will no longer be possible and/or appropriate for game engines to simply throw frames up as quickly as possible. Certain technologies like VRS combined with a drastically different GPU algorithm and other technologies will push game engines into more of a passive, asynchronous (possibly even callback driven? That’s an interesting idea) workflow.

I am getting a bit off-topic here, however, my core point still stands: MCM-based designs have their own hurdles to deal with. They aren’t a magical fix for everything.
 

eek2121

Platinum Member
Aug 2, 2005
2,931
4,027
136
Completely new architecture means completely new architecture. If it didn't it wouldn't be a completely new architecture.

Technically Ryzen was a completely new architecture and it gave us a 52% improvement. *ducks*

EDIT: Can you imagine Zen 3 having 50% higher performance? That would be nuts! Note that I don’t in any way it will happen, but one can dream...
 
Feb 17, 2020
100
245
116
Technically Ryzen was a completely new architecture and it gave us a 52% improvement. *ducks*

EDIT: Can you imagine Zen 3 having 50% higher performance? That would be nuts! Note that I don’t in any way it will happen, but one can dream...

That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Given Forrest's comments and Zen 2's 15% IPC uplift over Zen 1, I'd consider a Zen 2 to Zen 3 IPC uplift below 20% to be disappointing.
 

eek2121

Platinum Member
Aug 2, 2005
2,931
4,027
136
That 52% was compared to Bulldozer though, and literally anything would have been better than Bulldozer.

Given Forrest's comments and Zen 2's 15% IPC uplift over Zen 1, I'd consider a Zen 2 to Zen 3 IPC uplift below 20% to be disappointing.

Yes, because it was a completely new architecture. ;)
 

DrMrLordX

Lifer
Apr 27, 2000
21,694
10,964
136
I don't think AMD is projecting 50%+ performance improvements in Zen3 except in certain FP workloads. Which means they're probably moving towards 3x or 4x256b FMACs (away from 2x256b).
 

yuri69

Senior member
Jul 16, 2013
395
635
136
Everytime I go find new changes in the manuals and the kernel changes, I keep wondering about Forrest's comment about Zen3.
Zen3 features a different CCX layout implying a new L3 compared to family 17h. This fact alone makes it a different AMD architecture...
 
  • Like
Reactions: Tlh97

amd6502

Senior member
Apr 21, 2017
971
360
136
Not correct - AMD compared Zen (Summit Ridge) to Excavator (2015), Bulldozer was a first gen released on October 2011.
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
https://en.wikipedia.org/wiki/Excavator_(microarchitecture)

Yup, huge difference there. Even the difference btwn BD and Piledriver is very significant (BD was really much more of prototype / 0-th gen / or pre-release rather than a real normal 1st gen product.)

XV had quite decent IPC.

Now if XV had an L3 I wonder how it would have compared to Zen1.

Or, perhaps someone can figure out how to benchmark the two with L3 disabled (or cut down to something like 512kb per core, which would even out the cache between the two).
 

eek2121

Platinum Member
Aug 2, 2005
2,931
4,027
136
Zen3 features a different CCX layout implying a new L3 compared to family 17h. This fact alone makes it a different AMD architecture...

We are assuming Zen 3 has the concept of a CCX at all. In an ideal world AMD wouldn’t even have an IO die. The chiplets themselves would be completely self contained.