Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 434 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Det0x

Golden Member
Sep 11, 2014
1,032
2,981
136
Well since we are on the topic of cache performance and the IO-die is rumored to be the same between Zen4 and Zen5, i can share some numbers for what a maxed out memory subsystem for a 8core single CCD with v-cache can do (7800X3D)

SR 2x16gigs adie
1:1 memory mode @ 6666MT/s CL26-37-32-30-62
2222mhz FCLK
1715352121328.png

Full screenshot with more information and stability tests completed
1715352080764.png

2x16gigs adie
2:1 memory mode @ 8080MT/s CL32-45-40-44-84
2222mhz FCLK 2:1 mode
1715352275594.png

Full screenshot with more information and stability tests completed
1715352248488.png

*edit*
Can also share some numbers for a comparison i did a while back with a 7950X vs 7950X3D
Vanilla Zen4 bandwidth VS X3D Zen4 bandwidth

1715352564708.png vs 1715352587974.png

Vanilla Zen4 latency VS X3D Zen4 latency
1715352629645.png
vs 1715352649505.png

All this extra gaming performance in Zen4 X3D comes from this little red square
 
Last edited:

ToTTenTranz

Member
Feb 4, 2021
75
123
76
1715352564708.png
vs
1715352587974.png


Vanilla Zen4 latency VS X3D Zen4 latency
1715352629645.png
vs
1715352649505.png


All this extra gaming performance in Zen4 X3D comes from this little red square

Why does the Zen4 show higher bandwidth / lower latency below 32MB? Are those caches clocking higher?
 
  • Like
Reactions: DisEnchantment

naukkis

Senior member
Jun 5, 2002
722
610
136
"Instructions Per Cycle" means instructions per cycle. Is that so hard to memorize?

Edit, as an example, when one processor spins on a lock for 0.2 ms, and the other for 0.3 ms, which of the two processors got the higher Instructions Per Cycle count?

IPC can be calculated for game too including stalled cycles by locking as told before. But case with comparing 5700 to 5700x removes that argument, both CPU's share same 8c-ccx with similar clocks and 5700 actually have little less locking penalty as it have built-in northbridge vs external in 5700x. So pretty much most of performance difference comes from doubled L3-cache. And even games should not use spinlocks as that is totally inefficient way to handle locking - and todays cpu's boost algorithms also will reduce cpu performance if spin locks are used.
 

naukkis

Senior member
Jun 5, 2002
722
610
136
By looking at CPU performance counters?
Or by looking at Frames Per Second on the display output?
That was the point of #10,817, basically.

Game is doing given numbers of instructions per frame if stupid things like lock spinning is excluded. And if it's included - performance that matters is that fps not non-useful instruction count executed. So yeah, when comparing game performance measure fps over anything non revealing measurements.
 

StefanR5R

Elite Member
Dec 10, 2016
5,603
8,048
136
Instructions Per Frame are not a constant.
Prove me wrong. :-)

(Or don't. Somebody requested this loop to end already a while ago.)

Edit: Those who are interested in game performance in terms of Frames Per Second should, by all means, measure Frames Per Second. But for CPU microarchitecture analyses, like the one in the ISSCC presentation, perhaps additional steps could be taken.

Edit 2: CPU Cycles Per Second aren't a constant in the linked Techspot article either.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
722
610
136
Instructions Per Frame are not a constant.
Prove me wrong. :-)

(Or don't. Somebody requested this loop to end already a while ago.)
Old one threaded game engines were synced to fps. Such a case fps per frame is pretty much constant. Multithreaded engines do run game/physical engines asyncronously to rendering engine so instructions aren't totally tied to fps, but for fps mattering visual part they still pretty much are, at least if there's enough threads to not stall visual side.
 
  • Like
Reactions: lightmanek

Ajay

Lifer
Jan 8, 2001
15,646
7,969
136


So we now have:

9800X, 8 cores, 170w TDP
Clock regression, ~100Mhz
IPC, ~10% compared to Zen4 <NEW>

OMG. I strongly recommend Mike Clark don't wake up anytime soon and keep sleeping until Zen6.
So we are down to throwing as much spaghetti (G-Rated) as possible against the wall to see what sticks. Then claim 100% accuracy in prediction. Seriously, how bent does one have to be.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,642
14,634
136
No, there must have been mixed up something.
Zen 1 -> Zen 2: circa double the FP throughput per core, circa double the throughput/Watt​
Zen 2 -> Zen 3: some throughput increase but barely any throughput/Watt increase in most cases, big benefit to special multithreaded workloads which have larger than 16 MB cache footprint​
Zen 3 -> Zen 4: notably higher throughput and throughput/Watt, additional performance increase in vectorized FP workloads​
in various Distributed Computing applications. (These are applications which are highly parallel/ almost entirely compute-bound/ power-limited workloads with FP focus. One could conclude that the manufacturing node updates are all what counts in this set of workloads. But really, microarchitecture updates <edit: and SOC updates> and node updates go hand in hand as they enable and leverage each other.)

[I don't have Zen 1/ Naples (but Broadwell-EP which has got similar throughput/Watt), nor do I have Zen 3 myself. I do have Zen 2/ Rome and Zen 4/ Genoa in machines which are configured to same core counts and similar power budgets. My conclusions relative to Zen 1 and Zen 3 rely on what I have seen from others' computers.]

Zen 5 in Distributed Computing? I trust that AMD carves out a decent perf/W update once again, despite only a minor manufacturing node update. But how much? Various hints earlier in this thread sounded promising to me. Though so far, 1T or/and iso-clock or/and integer performance characteristics have been more of a focus in this thread so far, rather than nT iso-power FP.


Actually SMT does measurably improve throughput in PrimeGrid on Zen 4, desktop and server, and does improve perf/W slightly. In contrast, on Zen 2 and Zen 3, SMT usage in PrimeGrid provides no or sometimes a small host throughput advantage but always reduces perf/W. (PrimeGrid is vectorized FP with large cache footprint, but not too large on Zen 3 and 4 if the user gives hints to the OS's process scheduler. Zen 2's cache is too small in many but not all of PrimeGrid's currently active projects.)
My take I guess is just my own, but its close to yours. I think 2 points are shared. Efficiency is key for us, and avx-512 helps a lot.
 
  • Like
Reactions: Tlh97 and Joe NYC

StefanR5R

Elite Member
Dec 10, 2016
5,603
8,048
136
My take I guess is just my own, but its close to yours.
Well, not owning Zen 1 and Zen 3 myself, I don't ultimately trust my own assessments of them. Though back in the day, the Zen 1-->2 step evidently was a big one in perf/host and perf/W thanks to the Glofo 14nm --> TSMC 7nm switch, but not only due to that as the Zen 2 core and SOC update was far from a straightforward shrink.

The step which lies ahead, TSMC 5nm --> 4nm, will be nothing in comparison, yet AMD appears to widen the core a lot, presumably put a lot of smarts into the frontend to actually be able to put this width to use, yet at the same time will practically keep the power budget per core unchanged. I am really curious how that will turn out in power limited loads.

Efficiency is key for us,
Yep, as the aggregate core count in the household reaches certain above-average levels, and many of these cores are actually used 24/7 (be it for Citizen Science or for engineering jobs etc.), small things like the electric bill, the heat load in the home, or which computer to attach to which power circuit do become more of a concern. I find myself thinking more often in terms of perf/host and perf/W than perf/core. So, while the (alas rather circular) iso-clock performance discussions here in this thread are surely interesting (vulgo: IPC), what I am looking forward to more is to eventually get to see perf/W figures.
 

StefanR5R

Elite Member
Dec 10, 2016
5,603
8,048
136
For compute nodes,
– CPUs with cores of uneven per-core performance,​
– area-optimized cores​
are not attractive. You'd want
+ CPUs with homogeneous cores,​
+ cores and SOCs which are optimized towards a certain point between the three targets performance, performance efficiency, and performance density.​
The particular location of the optimization sweet spot depends on your cost structure (e.g. whether or not there are software licensing costs involved; whether or not rack space is at a premium to you…).

Edit, that's also true for home computers, if used for computing in the narrower sense, "HPC at home" if you will. E.g. when I built my first two dual-socket computers a while back, I needed not just plain perf/dollar (which would have been much better with desktop computers) but also perf/node (due to synchronization overhead in my application, which was too high over Ethernet for my purpose) and perf/core (due scaling difficulties in this application). If CPUs with "e cores" had been available back at that time, they would not have been what I needed due to the latter aspect. Edit 2, nowadays I accumulated enough computers that "rack space" (shelf space actually) is definitely a criterion to me too. (Energy consumption more so, though.)
 
Last edited:

Fjodor2001

Diamond Member
Feb 6, 2010
3,845
313
126
For compute nodes,
– CPUs with cores of uneven per-core performance,​
– area-optimized cores​
are not attractive. You'd want
+ CPUs with homogeneous cores,​
+ cores and SOCs which are optimized towards a certain point between the three targets performance, performance efficiency, and performance density.​
The particular location of the optimization sweet spot depends on your cost structure (e.g. whether or not there are software licensing costs involved; whether or not rack space is at a premium to you…).

Edit, that's also true for home computers, if used for computing in the narrower sense, "HPC at home" if you will. E.g. when I built my first two dual-socket computers a while back, I needed not just plain perf/dollar (which would have been much better with desktop computers) but also perf/node (due to synchronization overhead in my application, which was too high over Ethernet for my purpose) and perf/core (due scaling difficulties in this application). If CPUs with "e cores" had been available back at that time, they would not have been what I needed due to the latter aspect. Edit 2, nowadays I accumulated enough computers that "rack space" (shelf space actually) is definitely a criterion to me too. (Energy consumption more so, though.)
I guess it depends on what workloads you are running. Most people with DT systems do not have them mounted in racks. So space is not really a concern.

I think for a typical DT user with mixed workloads this is more important:
1. Max ST performance up to a certain amount of cores, e.g. ~8C.
2. For use cases with above ~8C, you want max MT perf, max perf/watt, and max E core count for lowest price.

For 1) you want P cores, and for 2) you want E cores. For those only needing 1), they can be satisfied with ~8C P cores only.