Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 990 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MS_AT

Senior member
Jul 15, 2024
822
1,664
96
It's the Nimo Mini PC Pro. It has a GameMax 12VO Flex ATX unit that appears to be a modified version of their Flex ATX gold rated 350W, and back-of-the-napkin math comparing it to the mini pc's using a power brick, which has a datasheet-provided efficiency figure, it looks to be 90%+ efficient, like GameMax claims.

It doesn't seem to me like the PSU efficiency is a major factor here, especially considering my desktop is
1) using a 1000W unit so 190 watts is not meaningfully more efficient than this PSU appears to be
2) has the GPU idling at 35W because high refresh rate displays
3) has 8 fans, leds, a 4x25GbE SFP28 NIC chugging additional power overhead
and it still draws the same power at the wall.

my desktop idles at 100W at the wall, so the relatively smaller idle to load power draw delta is pretty definitive IMO.
Sure, I didn't mean to say it's PSU fault;) Anyway doing some digging it seems notebookcheck measured 154ns latency https://www.notebookcheck.net/HP-Z2...en-AI-Max-and-Radeon-RX-8060S.1069652.0.html# that's much worse than I thought. I was expecting something like 100-110ns like Strix Point. I guess it's double the latency of your desktop.
 
  • Like
Reactions: igor_kavinski

Josh128

Golden Member
Oct 14, 2022
1,164
1,759
106
It’s interesting to note that Ars also had similar results, in fact I would say the 9700X was doing much better relative to the power consumption when compared to the Max+ 395 in Handbrake.
Note: All AMD readings include the package power.

View attachment 128461View attachment 128462View attachment 128463

However the 9700X did poorly against the Max+ 395 in Cinebench/Blender.
Whats going on with Handbrake here? 32 threads equaling 16 threads while using twice the power? Getting smoked by 24 threads at the same power? The only thing that makes sense is that something is broken for Halo in this workload.
 
Jul 27, 2020
27,160
18,665
146
The only thing that makes sense is that something is broken for Halo in this workload.
The broken thing is probably latency. It's what destroyed Arrow Lake's potential. As a version 1.0 product, Strix Halo is decent. Just not the realization of all our dreams. Medusa Halo may end up fixing a lot of the issues.
 
  • Like
Reactions: fastandfurious6

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,840
3,221
146
Whats going on with Handbrake here? 32 threads equaling 16 threads while using twice the power? Getting smoked by 24 threads at the same power? The only thing that makes sense is that something is broken for Halo in this workload.
In my testing, strix halo has some rigid scheduling rules that make every effort to group threads onto CCD0, so something that schedules, say, 1 thread per physical core expecting that is the outcome actually gets all 16 threads shoved onto CCD0, leaving the other CCD idle.

As far as I can figure out there's no way around this for strix halo. 9950X3D exhibits similar behavior  sometimes but changing the scheduling directive in the bios fixes it.

So workloads that dont saturate all 32 threads but expect to get every physical core, well, don't.

My unreal results were significantly worse for Strix Halo initially due to this. I used config overrides to spam enough threads that the scheduler had no choice but to saturate both CCDs.

AFAIK Handbrake is similar, it doesnt saturate every logical processor so if the scheduling is bunching as many threads as it can on CCD0 and only spilling over onto CCD1, it will perform extremely poorly with no real good remedy.

This behavior might make sense in a laptop with extremely constrained cooling, power delivery, and a battery but it doesnt once those factors are removed, as is the case with these mini PCs.

NOTE: all the perf figures I shared earlier were after I applied workarounds to get full CPU saturation, this is not the reason why its inexplicably slow and not efficient compared to GNR eco mode in my workloads.
 
Last edited:

poke01

Diamond Member
Mar 8, 2022
4,027
5,354
106
This behavior might make sense in a laptop with extremely constrained cooling, power delivery, and a battery but it doesnt once those factors are removed, as is the case with these mini PCs.
surely its up to AMD and Framework to test this and fix it. They are leaving perf on the table.
 

StefanR5R

Elite Member
Dec 10, 2016
6,621
10,468
136
I just wonder what is the reason the performance is worse than expected given apparently higher clocks.
One potential reason: If the cores stall a lot on memory accesses, they don't pull a lot of power and the firmware may therefore clock them higher.

I have seen this with other workloads (vector arithmetic; if the data fits into L3$, it is energy-intensive, produces quite a bit of heat and reduces the core clocks while power-limited; but if the data has to be read/written a lot from/to main memory, the job takes of course longer, the CPU produces less heat yet the cores clock higher). I don't know if this corresponds with what @Hail The Brain Slug saw.
 

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,840
3,221
146
One potential reason: If the cores stall a lot on memory accesses, they don't pull a lot of power and the firmware may therefore clock them higher.

I have seen this with other workloads (vector arithmetic; if the data fits into L3$, it is energy-intensive, produces quite a bit of heat and reduces the core clocks while power-limited; but if the data has to be read/written a lot from/to main memory, the job takes of course longer, the CPU produces less heat yet the cores clock higher). I don't know if this corresponds with what @Hail The Brain Slug saw.
Temps were sky high, it was the hottest workload by far. Wall draw never showed a decrease in power consumption, it was pegged at max power nonstop and drawing 190W the entire time.
 

MS_AT

Senior member
Jul 15, 2024
822
1,664
96
One potential reason: If the cores stall a lot on memory accesses, they don't pull a lot of power and the firmware may therefore clock them higher.
I discarded that options because because as Hail reinforces above, the power draw was still significant. Another option is that it was unstable and clock-streching. Either way something was off.
 

fastandfurious6

Senior member
Jun 1, 2024
689
871
96
there's something about if a chip is overly-optimized for low power then max perf suffers right?

it's super insane they managed to slap full 9950X+midrange gpu into a handheld lmao
 

yottabit

Golden Member
Jun 5, 2008
1,663
853
146
I mean, I’m not surprised a chip with V-cache in eco mode can beat out Strix Halo efficiency-wise in certain workloads. Assuming more of the hot loop code can fit into the L3 it would make sense. IMO it would be more “fair” to compare it to 9950x efficiency before concluding there is something “wrong” with Halo
 

poke01

Diamond Member
Mar 8, 2022
4,027
5,354
106
I mean, I’m not surprised a chip with V-cache in eco mode can beat out Strix Halo efficiency-wise in certain workloads. Assuming more of the hot loop code can fit into the L3 it would make sense. IMO it would be more “fair” to compare it to 9950x efficiency before concluding there is something “wrong” with Halo
i mean it’s loosing to a 9700X in the Ars review in the Handbrake test. Something is wrong and AMD so far hasn’t commented.
 

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,840
3,221
146
I mean, I’m not surprised a chip with V-cache in eco mode can beat out Strix Halo efficiency-wise in certain workloads. Assuming more of the hot loop code can fit into the L3 it would make sense. IMO it would be more “fair” to compare it to 9950x efficiency before concluding there is something “wrong” with Halo
I retested the workloads with each CCD disabled to verify V$ gains. It was only single digit %, so nothing significant to alter the outcome of my other testing.
 

poke01

Diamond Member
Mar 8, 2022
4,027
5,354
106
While going thru the HPC ARM vs x86 rabbit hole.

I found this interesting note in regards to Strix Halo in Windows.


Things that make performance-squeezing-out tricky on Windows

Regardless of the power plan, the second CCD remains parked by default—even when running on AC power—and it doesn’t wake up unless all 16 threads (8 cores + 8 SMT) are fully utilized. As a result, if you run a 16-threaded program, the second CCD won’t be activated. I’m not sure whether this behavior is controlled by AMD or HP, but I hope this policy will be changed later.

So, to make use of 16 threads across the two CCDs while running the COMSOL benchmark, I had to use Process Lasso to manually wake up the second CCD.

It would be best if HP provided an option to disable SMT in the BIOS, but I could not find it. Considering this laptop is intended for workstation use, I think this is more or less disappointing.



This is down to AMD to fix since it’s happening on ALL Halo machines regardless of laptop or mini PC. AMD cannot advertise this as a workstation class machine when it’s running Windows…
 

StefanR5R

Elite Member
Dec 10, 2016
6,621
10,468
136
One potential reason: If the cores stall a lot on memory accesses, they don't pull a lot of power and the firmware may therefore clock them higher.
Temps were sky high, it was the hottest workload by far. Wall draw never showed a decrease in power consumption, it was pegged at max power nonstop and drawing 190W the entire time.
I should have written: If the cores stall a lot on memory accesses, they don't pull a lot of power at a given clock speed, e.g. 3.8 GHz (at which the 9950X3D-eco happened to pull the same system power), and the firmware may therefore clock them higher until the power limit is reached again, or any other limit is reached (temperature, Amperage, …, if not f_max), e.g. 4.6 GHz (was that a time averag? – it is already 90% of f_max), which just means burning power for burning power's sake while the execution units aren't actually doing much.
 

StefanR5R

Elite Member
Dec 10, 2016
6,621
10,468
136
Regardless of the power plan, the second CCD remains parked by default
Does anybody know whether Strix Point computers are set up in the same way — that is, keep the dense CCX idle as long as all runnable software threads fit onto the logical CPUs of the classic CCX¹, IOW prefer SMT usage over dual-CCX spread usage?

________
¹) That would be 8 threads in case of Strix Point, except Ryzen AI 7 PRO 360 in which the classic CCX only has 3 cores/ 6 threads.
 
Last edited:

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
8,170
3,101
146
Any updates here on the supposedly upcoming CPU with dual 3D cache CCDs?