Question Zen 6 Speculation Thread

Page 317 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

OneEng2

Senior member
Sep 19, 2022
950
1,161
106
No, that's not what this post says. This post shows "TDP scaling (Cinebench 2024 MC)", plotted by Computerbase.

C i n e b e n c h

It's something quite special. Better to not make general statements based on data like that.
  • 3D CPU rendering
  • Video encoding (CPU-heavy)
  • Scientific/M&E multi-thread workloads
  • Compilers
  • Anything that scales perfectly with threads and uses FP math heavily

I'll give it to you that this isn't the bulk of most day to day usage for the vast majority of x86 users, but that is another subject all together.

The post I have been debating is that some still believe that somehow Zen 6 24c will beat NVL 52c in these kinds of apps. To me, the math just doesn't make sense. Seems like NVL will win easily in these applications.
 

poke01

Diamond Member
Mar 8, 2022
4,556
5,853
106
I think the confidence comes from the fact that arctic wolf e core will be amazing.


It’s still an E core doesn’t matter if it has AVX10.2. No e core or small core ever is substitute for an actual P core like Zen6 or Coyote cove.

So for me it’s 16P cores+ 32 Cinebench accelerators and 4LPE cores that don’t belong in desktop.

I rather have 24 zen6 cores. You all getting caught up the number of cores but what matters is how good 1t. That’s it.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,812
10,521
106
I think the confidence comes from the fact that arctic wolf e core will be amazing
Indeed.
Unfortunately, 288 DKTs on 18A struggle to compete with 192c Z5dense on N3e so back to reality they go.
It’s still an E core doesn’t matter if it has AVX10.2. No e core or small core ever is substitute for an actual P core like Zen6 or Coyote cove.

So for me it’s 16P cores+ 32 Cinebench accelerators and 4LPE cores that don’t belong in desktop.
Really not the problem.
Atoms are competent all-around cores with good area and horrible power.
They just kinda suck at power-limited nT.
 

MS_AT

Senior member
Jul 15, 2024
902
1,805
96
1t is all that matter for MT. Oookey.
Ever heard of https://en.wikipedia.org/wiki/Amdahl's_law ?
Video encoding (CPU-heavy)
Has trouble scaling unless you split it into parts and encode in parallel. At least what I have read on x265's enthusiast forums.
Compilers
Real build systems hit https://en.wikipedia.org/wiki/Amdahl's_law If you want proof, go over servethehome.com Linux compile tests over the years and how they changed the methodology. Or better yet, try to build chromium with varying number of workers on your own.
Anything that scales perfectly with threads and uses FP math heavily
Shouldn't these end up on GPUs?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,889
5,414
136
No need to spam the thread with such basics again and again. It’s already been posted and discussed numerous times.

We’re talking about a 48T scenario here. So it’s the same situation for both Zen6 24C/48T and NVL-S 48C/T.

In PCs (Personal Computers) gains from the threads at thrdr thread counts are flatlined, meaning you are getting practically no gains, and 2nd CCD (in both cases of Zen 6 and NVL) delivers almost nothing.

I am not sure why you are making this argument, in mostly irrelevant (flat) part of the Amdahl curve, where you could make an argument instead in comparing 1 CCD Zen 6 and NVL, where, in both cases additional threads still may deliver performance increment that is > 0.

What's the obsession with 48+ threads?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,889
5,414
136
  • Haha
Reactions: Tlh97 and Racan

MS_AT

Senior member
Jul 15, 2024
902
1,805
96
It’s already been posted and discussed numerous times.
And so many times you failed to understand why this is a problem;)
We’re talking about a 48T scenario here. So it’s the same situation for both Zen6 24C/48T and NVL-S 48C/T.
It's not. When the Amdahl's law hits you, you basically care that the longest subtask that others are waiting for is running on fastest core. It's easier to achieve with homogeneous cores. For example most builds systems are not aware of hybrid CPUs and are trusting the OS and Thread director will do a good job. Spoiler alert, they sometimes fail miserably. I mean people at work explicitly disable E cores, so the compiles go faster... (keep in mind it's a dev work, which is different than CI work). Disclaimer we don't have Arrow Lake to test with, only Raptor and Meteor Lake. But Windows still have problem with handling that so many years after Alder Lake.
There's a ton of usecases for CPU SIMD.
I don't deny it. But the conditions he specified are perfect for GPUs. I mean if something scales perfectly with increasing thread count it's probably embarrassingly parallel workload with little communication between workers, and since it's pure number crunching little need for branch prediction.
 

StefanR5R

Elite Member
Dec 10, 2016
6,779
10,783
136
  • 3D CPU rendering
If we pick just this one as an example MT application area and look around different renderers (and different test scenes on top maybe), we observe that Cinebench is not giving the full picture of perf/W by CPU vendor x vs. CPU vendor y. Not at all.
 

DrMrLordX

Lifer
Apr 27, 2000
23,059
13,163
136
Indeed.
Unfortunately, 288 DKTs on 18A struggle to compete with 192c Z5dense on N3e so back to reality they go.

The worst part about Clearwater Forest is that it isn't even available yet, while Turin-dense has been on the market for awhile now.

In the meantime, Lisa Su's Thanksgiving turkey size seems to be scaling proportionally with AMD profits.
She hasn't been showing off huge rings recently, so maybe she has to compete with JHH's leather jackets using roast turkeys.
 
  • Haha
Reactions: Joe NYC

Fjodor2001

Diamond Member
Feb 6, 2010
4,377
651
126
In PCs (Personal Computers) gains from the threads at thrdr thread counts are flatlined, meaning you are getting practically no gains, and 2nd CCD (in both cases of Zen 6 and NVL) delivers almost nothing.

I am not sure why you are making this argument, in mostly irrelevant (flat) part of the Amdahl curve, where you could make an argument instead in comparing 1 CCD Zen 6 and NVL, where, in both cases additional threads still may deliver performance increment that is > 0.

What's the obsession with 48+ threads?
The context was OneEng2’s statement that NVL-S 48C/T is likely to win over Zen6 24C/48T in 48T MT scenarios. So both CPUs will be executing 48T and thus they are at the same point on Amdahl’s curve, so throwing that curve into the discussion does not add anything for that scenario.
 
  • Like
Reactions: MoistOintment

Fjodor2001

Diamond Member
Feb 6, 2010
4,377
651
126
And so many times you failed to understand why this is a problem;)

It's not. When the Amdahl's law hits you, you basically care that the longest subtask that others are waiting for is running on fastest core. It's easier to achieve with homogeneous cores. For example most builds systems are not aware of hybrid CPUs and are trusting the OS and Thread director will do a good job. Spoiler alert, they sometimes fail miserably. I mean people at work explicitly disable E cores, so the compiles go faster... (keep in mind it's a dev work, which is different than CI work). Disclaimer we don't have Arrow Lake to test with, only Raptor and Meteor Lake. But Windows still have problem with handling that so many years after Alder Lake.

I don't deny it. But the conditions he specified are perfect for GPUs. I mean if something scales perfectly with increasing thread count it's probably embarrassingly parallel workload with little communication between workers, and since it's pure number crunching little need for branch prediction.
See my previous post above. Also, you have similar problems with fast vs slow threads on Zen6, due to SMT and some tasks/threads executing faster or slower due to that.

Then we also have cases where multiple apps are executing in parallel. Does not have to be a single app using all 48T.

The scenario discussed was when all 48T are actually being used, regardless of how that is done.
 
Last edited:
  • Like
Reactions: OneEng2

Kryohi

Member
Nov 12, 2019
59
118
106
the conditions he specified are perfect for GPUs. I mean if something scales perfectly with increasing thread count it's probably embarrassingly parallel workload with little communication between workers, and since it's pure number crunching little need for branch prediction.
Eh in the real world it often doesn't work like this.
Maybe for big companies, but otherwise
1. Porting software to use gpgpu is a PITA, only worth for big and very reusable stuff
2. Abysmal FP64 performance on modern GPUs if you need that
3. Who says code with a lot of branches must necessarily have a lot of communication between threads?
4. Often you need more cores because you have a lot of data to work on in parallel (e.g. a 3 hour long 4k video to encode vs a 10m 1080p one, or a bionformatics pipeline), not because the actual algorithms used are particularly parallelizzabile (see again e.g. x265)

A lot of CPU cores are useful for a lot of people, though I personally do not like at all P+E configurations, and I know for a fact people have had trouble with them with a couple of different and widely used scientific software.
 
Last edited:
  • Like
Reactions: Joe NYC

MS_AT

Senior member
Jul 15, 2024
902
1,805
96
1. Porting software to use gpgpu is a PITA, only worth for big and very reusable stuff
That is generally true but this particular case (massively parallel, heavy math) should be easier to port relative to other things.
Abysmal FP64 performance on modern GPUs if you need that
That's the case only for consumer GPUs and even then except for iGPUs I am not sure 9950x has higher FP64 performance than mid class consumer GPU. Especially if you factor in the massive memory BW disadvantage.
3. Who says code with a lot of branches must necessarily have a lot of communication between threads?
I don't know. If you reread my message it said code with lots of number crunching will not have a lot of branches. FFT kernels, matmul kernels. The only branches come from loops or you have done something wrong.
Often you need more cores because you have a lot of data to work on in parallel (e.g. a 3 hour long 4k video to encode vs a 10m 1080p one, or a bionformatics pipeline), not because the actual algorithms used are particularly parallelizzabile (see again e.g. x265)
Sorry, but I am not sure where this came from. Anyway I was not saying that people shouldn't get more cores if their workflow demands it. Just that in a lot of MT cases 1T perf still matters.

I know for a fact people have had trouble with them with a couple of different and widely used scientific software.
Add virtualization software to the list.
 
  • Like
Reactions: Tlh97 and Hitman928

Fjodor2001

Diamond Member
Feb 6, 2010
4,377
651
126
Is anything known about whether there will be any NPU on Zen6 DT?

NVL-S is expected to have NPU6 @ 74 TOPS INT8. I assume one of the reasons is to comply with the Microsoft Copilot+ PC requirement of 40+ TOPS. So will Zen6 DT follow the same path, or be declared non-compliant with that requirement?