Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Det0x · May 10, 2024

Well since we are on the topic of cache performance and the IO-die is rumored to be the same between Zen4 and Zen5, i can share some numbers for what a maxed out memory subsystem for a 8core single CCD with v-cache can do (7800X3D)

SR 2x16gigs adie
1:1 memory mode @ 6666MT/s CL26-37-32-30-62
2222mhz FCLK

Full screenshot with more information and stability tests completed

2x16gigs adie
2:1 memory mode @ 8080MT/s CL32-45-40-44-84
2222mhz FCLK 2:1 mode

Full screenshot with more information and stability tests completed

*edit*
Can also share some numbers for a comparison i did a while back with a 7950X vs 7950X3D
Vanilla Zen4 bandwidth VS X3D Zen4 bandwidth

vs

Vanilla Zen4 latency VS X3D Zen4 latency
vs

All this extra gaming performance in Zen4 X3D comes from this little red square

ToTTenTranz · May 10, 2024

Det0x said:
vs

Vanilla Zen4 latency VS X3D Zen4 latency

vs

All this extra gaming performance in Zen4 X3D comes from this little red square

Why does the Zen4 show higher bandwidth / lower latency below 32MB? Are those caches clocking higher?

naukkis · May 10, 2024

StefanR5R said:
"Instructions Per Cycle" means instructions per cycle. Is that so hard to memorize?

Edit, as an example, when one processor spins on a lock for 0.2 ms, and the other for 0.3 ms, which of the two processors got the higher Instructions Per Cycle count?

IPC can be calculated for game too including stalled cycles by locking as told before. But case with comparing 5700 to 5700x removes that argument, both CPU's share same 8c-ccx with similar clocks and 5700 actually have little less locking penalty as it have built-in northbridge vs external in 5700x. So pretty much most of performance difference comes from doubled L3-cache. And even games should not use spinlocks as that is totally inefficient way to handle locking - and todays cpu's boost algorithms also will reduce cpu performance if spin locks are used.

adroc_thurston · May 10, 2024

ToTTenTranz said:
Are those caches clocking higher?

yea.

ToTTenTranz said:
lower latency

and you gotta pay 4 cycles for V$ latching.

Tuna-Fish · May 10, 2024

ToTTenTranz said:
Why does the Zen4 show higher bandwidth / lower latency below 32MB? Are those caches clocking higher?

Yes. They clock at core clock and the highest 1T boost of the non-x3d is notably higher.

StefanR5R · May 10, 2024

naukkis said:
IPC can be calculated for game too

By looking at CPU performance counters?
Or by looking at Frames Per Second on the display output?
That was the point of #10,817, basically.

naukkis · May 10, 2024

StefanR5R said:
By looking at CPU performance counters?
Or by looking at Frames Per Second on the display output?
That was the point of #10,817, basically.

Game is doing given numbers of instructions per frame if stupid things like lock spinning is excluded. And if it's included - performance that matters is that fps not non-useful instruction count executed. So yeah, when comparing game performance measure fps over anything non revealing measurements.

StefanR5R · May 10, 2024

Instructions Per Frame are not a constant.
Prove me wrong. :-)

(Or don't. Somebody requested this loop to end already a while ago.)

Edit: Those who are interested in game performance in terms of Frames Per Second should, by all means, measure Frames Per Second. But for CPU microarchitecture analyses, like the one in the ISSCC presentation, perhaps additional steps could be taken.

Edit 2: CPU Cycles Per Second aren't a constant in the linked Techspot article either.

naukkis · May 10, 2024

StefanR5R said:
Instructions Per Frame are not a constant.
Prove me wrong. :-)

(Or don't. Somebody requested this loop to end already a while ago.)

Old one threaded game engines were synced to fps. Such a case fps per frame is pretty much constant. Multithreaded engines do run game/physical engines asyncronously to rendering engine so instructions aren't totally tied to fps, but for fps mattering visual part they still pretty much are, at least if there's enough threads to not stall visual side.

Ajay · May 10, 2024

deasd said:
AMD Ryzen 9000 CPUs rumored to only have 10% IPC boost - but let's not panic over Zen 5 yet

We'll need more seasoning than usual with this claim from out of left-field that Zen 5 processors will only usher in a 10% generational IPC increase.

www.tweaktown.com

AMD Zen 5 CPUs Rumored To Feature Around 10% IPC Increase, Slightly More In Cinebench R23 Single-Thread Test

AMD's Zen 5 CPUs are rumored to feature an IPC increase of around 10% with the latest core architecture powering next-gen Ryzen & EPYC chips.

wccftech.com

So we now have:

9800X, 8 cores, 170w TDP
Clock regression, ~100Mhz
IPC, ~10% compared to Zen4 <NEW>

OMG. I strongly recommend Mike Clark don't wake up anytime soon and keep sleeping until Zen6.

So we are down to throwing as much spaghetti (G-Rated) as possible against the wall to see what sticks. Then claim 100% accuracy in prediction. Seriously, how bent does one have to be.

APU_Fusion · May 10, 2024

new ground up architecture to yield 30% gains, or 10% gains, or -15% gains, or 40% gains, or 0% gains over zen 4 🙄🙄🤷‍♂️or zen 3 or rl or al

Markfw · May 10, 2024

StefanR5R said:
No, there must have been mixed up something.

Zen 1 -> Zen 2: circa double the FP throughput per core, circa double the throughput/Watt
Zen 2 -> Zen 3: some throughput increase but barely any throughput/Watt increase in most cases, big benefit to special multithreaded workloads which have larger than 16 MB cache footprint
Zen 3 -> Zen 4: notably higher throughput and throughput/Watt, additional performance increase in vectorized FP workloads
in various Distributed Computing applications. (These are applications which are highly parallel/ almost entirely compute-bound/ power-limited workloads with FP focus. One could conclude that the manufacturing node updates are all what counts in this set of workloads. But really, microarchitecture updates <edit: and SOC updates> and node updates go hand in hand as they enable and leverage each other.)

[I don't have Zen 1/ Naples (but Broadwell-EP which has got similar throughput/Watt), nor do I have Zen 3 myself. I do have Zen 2/ Rome and Zen 4/ Genoa in machines which are configured to same core counts and similar power budgets. My conclusions relative to Zen 1 and Zen 3 rely on what I have seen from others' computers.]

Zen 5 in Distributed Computing? I trust that AMD carves out a decent perf/W update once again, despite only a minor manufacturing node update. But how much? Various hints earlier in this thread sounded promising to me. Though so far, 1T or/and iso-clock or/and integer performance characteristics have been more of a focus in this thread so far, rather than nT iso-power FP.

Actually SMT does measurably improve throughput in PrimeGrid on Zen 4, desktop and server, and does improve perf/W slightly. In contrast, on Zen 2 and Zen 3, SMT usage in PrimeGrid provides no or sometimes a small host throughput advantage but always reduces perf/W. (PrimeGrid is vectorized FP with large cache footprint, but not too large on Zen 3 and 4 if the user gives hints to the OS's process scheduler. Zen 2's cache is too small in many but not all of PrimeGrid's currently active projects.)

My take I guess is just my own, but its close to yours. I think 2 points are shared. Efficiency is key for us, and avx-512 helps a lot.

StefanR5R · May 11, 2024

Markfw said:
My take I guess is just my own, but its close to yours.

Well, not owning Zen 1 and Zen 3 myself, I don't ultimately trust my own assessments of them. Though back in the day, the Zen 1-->2 step evidently was a big one in perf/host and perf/W thanks to the Glofo 14nm --> TSMC 7nm switch, but not only due to that as the Zen 2 core and SOC update was far from a straightforward shrink.

The step which lies ahead, TSMC 5nm --> 4nm, will be nothing in comparison, yet AMD appears to widen the core a lot, presumably put a lot of smarts into the frontend to actually be able to put this width to use, yet at the same time will practically keep the power budget per core unchanged. I am really curious how that will turn out in power limited loads.

Markfw said:
Efficiency is key for us,

Yep, as the aggregate core count in the household reaches certain above-average levels, and many of these cores are actually used 24/7 (be it for Citizen Science or for engineering jobs etc.), small things like the electric bill, the heat load in the home, or which computer to attach to which power circuit do become more of a concern. I find myself thinking more often in terms of perf/host and perf/W than perf/core. So, while the (alas rather circular) iso-clock performance discussions here in this thread are surely interesting (vulgo: IPC), what I am looking forward to more is to eventually get to see perf/W figures.

Fjodor2001 · May 11, 2024

StefanR5R said:
what I am looking forward to more is to eventually get to see perf/W figures.

For MT workloads primarily? Ideally there should have been models with E cores for that, if perf/W was important. But it looks like we won’t get that for Zen5 on DT at least.

randomhero · May 11, 2024

Fjodor2001 said:
For MT workloads primarily? Ideally there should have been models with E cores for that, if perf/W was important. But it looks like we won’t get that for Zen5 on DT at least.

Says who?
Just because Intel cannot do it, does not mean it is not possible for anyone else.

StefanR5R · May 11, 2024

For compute nodes,

– CPUs with cores of uneven per-core performance,

– area-optimized cores

are not attractive. You'd want

+ CPUs with homogeneous cores,

+ cores and SOCs which are optimized towards a certain point between the three targets performance, performance efficiency, and performance density.

The particular location of the optimization sweet spot depends on your cost structure (e.g. whether or not there are software licensing costs involved; whether or not rack space is at a premium to you…).

Edit, that's also true for home computers, if used for computing in the narrower sense, "HPC at home" if you will. E.g. when I built my first two dual-socket computers a while back, I needed not just plain perf/dollar (which would have been much better with desktop computers) but also perf/node (due to synchronization overhead in my application, which was too high over Ethernet for my purpose) and perf/core (due scaling difficulties in this application). If CPUs with "e cores" had been available back at that time, they would not have been what I needed due to the latter aspect. Edit 2, nowadays I accumulated enough computers that "rack space" (shelf space actually) is definitely a criterion to me too. (Energy consumption more so, though.)

adroc_thurston · May 11, 2024

Fjodor2001 said:
Ideally there should have been models with E cores for that

you can't do HMP in DC.
otherwise, stop being poor and buy Turin-D then?

soresu · May 11, 2024

adroc_thurston said:
otherwise, stop being poor

Dang gone it....... I got to learn that trick!

soresu · May 11, 2024

@adroc_thurston Any idea if Zen5 has (Intel) AMX?

adroc_thurston · May 11, 2024

soresu said:
Any idea if Zen5 has (Intel) AMX?

Why would it?
AMX only exists due to Intel feudalism.

Mahboi · May 11, 2024

adroc_thurston said:
you can't do HMP in DC.
otherwise, stop being poor and buy Turin-D then?

What's the max cores in Turin-D?

adroc_thurston · May 11, 2024

Mahboi said:
What's the max cores in Turin-D?

192c.

Mahboi · May 11, 2024

Makes Genoa Dense look like baby version...

Fjodor2001 · May 11, 2024

StefanR5R said:
For compute nodes,

– CPUs with cores of uneven per-core performance,
– area-optimized cores
are not attractive. You'd want

+ CPUs with homogeneous cores,
+ cores and SOCs which are optimized towards a certain point between the three targets performance, performance efficiency, and performance density.
The particular location of the optimization sweet spot depends on your cost structure (e.g. whether or not there are software licensing costs involved; whether or not rack space is at a premium to you…).

Edit, that's also true for home computers, if used for computing in the narrower sense, "HPC at home" if you will. E.g. when I built my first two dual-socket computers a while back, I needed not just plain perf/dollar (which would have been much better with desktop computers) but also perf/node (due to synchronization overhead in my application, which was too high over Ethernet for my purpose) and perf/core (due scaling difficulties in this application). If CPUs with "e cores" had been available back at that time, they would not have been what I needed due to the latter aspect. Edit 2, nowadays I accumulated enough computers that "rack space" (shelf space actually) is definitely a criterion to me too. (Energy consumption more so, though.)

I guess it depends on what workloads you are running. Most people with DT systems do not have them mounted in racks. So space is not really a concern.

I think for a typical DT user with mixed workloads this is more important:
1. Max ST performance up to a certain amount of cores, e.g. ~8C.
2. For use cases with above ~8C, you want max MT perf, max perf/watt, and max E core count for lowest price.

For 1) you want P cores, and for 2) you want E cores. For those only needing 1), they can be satisfied with ~8C P cores only.

adroc_thurston · May 11, 2024

Fjodor2001 said:
For use cases with above ~8C, you want max MT perf, max perf/watt, and max E core count for lowest price.

Nope, that's only ever useful for cinememe.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Golden Member

Member

Senior member

Platinum Member

Golden Member

Elite Member

Senior member

Elite Member

Senior member

Lifer

Senior member

Moderator Emeritus, Elite Member

Elite Member

Diamond Member

Member

Elite Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Senior member

Platinum Member

Senior member

Diamond Member

Platinum Member