Question Zen 6 Speculation Thread

Page 210 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Doug S

Diamond Member
Feb 8, 2020
3,362
5,901
136
In some cases, SMT can reach close to 2x performance gains. This can happen in I/O bound situations, e.g. harddisk or networking limitations.
And in case of todays many cloud services, more threads can help as well. There you often do not need maximum performance per thread, but more cores and threads and maximum energy efficieny.

SMT doesn't come into play for I/O AT ALL, it only covers very short gaps when the processor itself is waiting. When the OS is waiting on interrupt driven I/O the OS scheduler handles that. You can't afford to have an SMT thread paused for the microseconds let alone milliseconds (in the case of HDDs) on a core wasting resources. The scheduler will want to use those core resources for other tasks that are actually ready to run so it'll be context switched out and queued to run again when an interrupt signals I/O completion.
 

basix

Member
Oct 4, 2024
163
324
96
SMT doesn't come into play for I/O AT ALL, it only covers very short gaps when the processor itself is waiting. When the OS is waiting on interrupt driven I/O the OS scheduler handles that. You can't afford to have an SMT thread paused for the microseconds let alone milliseconds (in the case of HDDs) on a core wasting resources. The scheduler will want to use those core resources for other tasks that are actually ready to run so it'll be context switched out and queued to run again when an interrupt signals I/O completion.
Well, I have programmed such stuff myself, and parallel I/O requests can speed-up your application significantly. Instead of pushing IO through 1 core or 6 cores I pushed it through 12 parallel processes. Your SSD likes parallelism. Bandwidth increases and averaged "fetch to use" latency gets reduced. For example reading files with measurement data. With parallelism, averaged networking latency decreases as well and makes a huge impact if VPNs and security stuff are slowing you down - e.g. in home office. If you just have lightweight IO it does not matter too much if the parallelism comes from SMT or just more processes, but my CPU back then was pushed to 100% load. The system was not idling between IO requests. I often had to reduce the thread count to N - 2, that the system stayed somewhat responsive for anything else.

The speedup I have seen was nearly linear with count of parallel processes. So from 6 cores to 12 SMT threads it was ~1.9x speedup. But if I have gone to 24 parallel processes (2x per SMT thread) it got actually only little faster to read from disk or network. More cores and SMT threads would have helped.
 
Last edited:

Geddagod

Golden Member
Dec 28, 2021
1,440
1,551
106
They answer was Unified Core for this end of 2028 at best though.
Separate archs should still be better.
The fact that Intel's roadmap is to have a unified core strategy in the future is a VERY strong indicator that Intel engineering also believes that dissimilar architectures for P and E cores has too many down sides.
What are the other down sides other than cost though?
AMD is rumored to be diverging with Zen 7 IIRC, and Qualcomm and Apple also have separate P and E-core archs.
I would imagine an architecture specially designed to hit a certain frequency at a certain voltage given an area constraint is going to be better than an architecture designed for something else, and then being scaled down using physical design means. Ig the real question is if the PPA differences between the two approaches are worth the extra design effort.
 

adroc_thurston

Diamond Member
Jul 2, 2023
6,174
8,696
106
Separate archs should still be better.
Which is why Intel is moving away from that?
AMD is rumored to be diverging with Zen 7 IIRC
No, they're hacked down variants of Zen7 IP.
No different from AMD neutering Zen2 FPU for Sony.
No, they're half-chops of the mainline big.
Apple also have separate P and E-core archs
The only one left yes.
Besides ARM. But ARM plays in many, many more markers.
 

Geddagod

Golden Member
Dec 28, 2021
1,440
1,551
106
Which is why Intel is moving away from that?
They don't have the money anymore. Also consolidating talent is a good reason too ig.
No, they're hacked down variants of Zen7 IP.
No different from AMD neutering Zen2 FPU for Sony.
AMD's cut down FPU for sony was still pretty academically interesting at least. Hopefully its a similar case for Zen 7.
No, they're half-chops of the mainline big.
They are architecturally very different, according to Geekerwan at least
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,162
568
126
AMD is rumored to be diverging with Zen 7 IIRC, and Qualcomm and Apple also have separate P and E-core archs.
Correct. Basically everyone is doing some version of big.LITTLE nowadays. Even AMD with the Zen5 vs 5C cores, and soon they’ll have LP cores with Zen6 as well.

Apple, Qualcomm, Arm, Mediatek, Intel, Samsung, etc are doing the same. Yes, some are Arm derivates, but they still intentionally chose to use several core types in big.LITTLE style, instead of only selecting one Arm core type as base.

It’s quite obvious, since it’s a better architectural solution. You want so have separate core types performing their best for their frequency range. A few fast cores operating at the max frequency for good ST (or low thread count) perf, and then the rest of the cores operating at lower frequency for MT usage with better perf/watt and also often better perf/$.
 
  • Like
Reactions: Vattila

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,113
16,023
136
Correct. Basically everyone is doing some version of big.LITTLE nowadays. Even AMD with the Zen5 vs 5C cores, and soon they’ll have LP cores with Zen6 as well.

Apple, Qualcomm, Arm, Mediatek, Intel, Samsung, etc are doing the same. Yes, some are Arm derivates, but they still intentionally chose to use several core types in big.LITTLE style, instead of only selecting one Arm core type as base.

It’s quite obvious, since it’s a better architectural solution. You want so have separate core types performing their best for their frequency range. A few fast cores operating at the max frequency for good ST (or low thread count) perf, and then the rest of the cores operating at lower frequency for MT usage with better perf/watt and also often better perf/$.
Personally, I doubt that AMD will go big.little. The dense cores seem to work for them in server. They no place in client IMO.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,371
2,996
136
Client desktop, it's basically a wash. Client mobile? That's a different beast. For mobile, you absolutely have to deal with two big differences: idle power draw and sustained heavy MT thermals. To handle those, you need cores optimized for both cases. And, unsurprisingly, that what we're seeing with AMD. Desktop, save for the odd mobile derived APU, is homogeneous big cores. Mobile is either power restricted desktop parts (for now) and their implementation of big.LITTLE.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,371
2,996
136
As for AMD'S core strategy, I'm kind of surprised that they haven't adopted the route where there are 4 full spec 512 bit AVX512 cores for max clock and a sea of 256bit C cores. Like a more extreme Strix Point.
 

DrMrLordX

Lifer
Apr 27, 2000
22,743
12,742
136
It is Q1 26 product it will only have like 3-6 months vs Turin Dense before Venice Dense launches.

Since this is a Zen6 thread and not a Clearwater Forest thread, I'll just agree to disagree and instead reiterate that Intel won't be accomplishing much if they're able to compete well against Zen5 a mere 6 months (or less) from Zen6's launch.

Venice D should do a comfortable tap dance on CWF.... but CWF should eclipse Turin D in many applications. I am not quite sure about AVX512 enabled apps though because I am not certain how effective (if at all) CWF will be at executing AVX.

See above, the only reason you are not looking at Venice-dense in this scenario is if someone is cutting you a very good deal on Clearwater Forest and/or vendor lock-in (which is much less of a thing nowadays). Or maybe Venice-dense is sold out and you're stuck with the scraps.
 

MS_AT

Senior member
Jul 15, 2024
779
1,584
96
I'm kind of surprised that they haven't adopted the route where there are 4 full spec 512 bit AVX512 cores for max clock and a sea of 256bit C cores. Like a more extreme Strix Point.
Well but the most successful mobile CPU is not using sea of cores approach. You have at most 4 E cores in Apple M chips iirc.

I’d argue it might be more effective to equip the small cores with AVX-512, while simplifying their branch handling logic. Meanwhile, the fast cores could trade full AVX512 implementation for enhanced branch prediction and control logic, making them better suited for handling complex, branch-heavy scalar code and dispatching tasks to the smaller cores. This aligns with what I’ve read in some research papers—though not tied to any specific architecture—suggesting a hybrid model: a few powerful cores for control-heavy workloads, and many simpler cores for throughput-oriented tasks.

Placing AVX-512 on the small cores could also help reduce front-end pressure, as more work can be done per instruction, what means the decoder could be smaller. Since these cores aren’t designed to clock high, the frequency drop from AVX-512’s current demands would be less impactful compared to big cores. This assumes a full AVX-512 implementation on small cores, hoping that savings from reduced branch prediction units, register files, and buffers could keep the area small enough to let you spam them.

But the Frankenstein above would fit mostly in DC, rather than in consumer space. Not to mention it gives me vibes of knights landing merged with Lion Cove. Yeah, I think I got carried away;)
 

StefanR5R

Elite Member
Dec 10, 2016
6,589
10,375
136
Frankenstein
All of AMD's presently available CPUs in which physically differing cores are combined ( = Phoenix 2 and Strix Point) are aligned with one singular observation:

A power-constrained CPU can clock high when few software threads run, and has to stick with moderate if not low clock speed when more software threads run.​

Hence, AMD put cores with different f_max into these two CPUs, enabling some area savings by those cores which have lower f_max. (Also, lower f_max goes with lower amount of last level cache per core, which saves area too.) Otherwise, all of the cores within the CPU are the same. Operating system kernels deal with this setup well, ever since introduction of Intel® Turbo Boost Max Technology 3.0.

This has *nothing* to do with big.LITTLE in telephones.

Now, back to the topic: First there was the rumor that Strix Halo would introduce another class of AMD cores: Low-power cores, which sit in an extra core complex on the I/O die. (This did not materialize; whether these cores do not exist in the product or do exist but are fused off remains unknown until somebody outside AMD creates high-resolution I/O die shots.) Now the rumor is that Medusa Point (?) will receive such low-power cores. The raison d'être of these cores is
  • to run background stuff in connected standby mode, while the regular core complexes are in very low power mode (similar to Meteor Lake's LPE core complex),
  • *maybe* (or maybe not) run some less CPU intensive stuff during interactive use (similar to Lunar Lake's E core complex). Well, probably not, IMO.
Beyond adding special cores for this new (to AMD) specific function, I do not see AMD suddenly adopting Dr. Frankenstein's medical ethos.

Edit,
a few powerful cores for control-heavy workloads, and many simpler cores for throughput-oriented tasks [...] would fit mostly in DC, rather than in consumer space.
Not even into DC generally, but more into HPC. That's the one place where scientists tell the OS how to schedule software threads. ;-)
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
779
1,584
96
This has *nothing* to do with big.LITTLE in telephones.
Indeed, I hope I did not give a suggestion it does. I got triggered by the sea of cores, where the academics suggest using few powerful cores and a sea of simpler cores, possibly with accelerators attached.
That's the one place where scientists tell the OS how to schedule software threads
Yup, I put the wrong abbreviation in the end, but I had HPC in mind. After all this proposed CPU is conceptually closer to Grace Hopper or MI300A than simple Epyc [you could treat the small cores as accelerators even though it would still be one CPU instead of going off package]. Btw, software optimization guides from AMD and Intel also suggest to game developers to be mindful of thread scheduling;)
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,162
568
126
Because compilers and thread schedulers expect that parallelized programs will be executed on homogeneous hardware.
Not if they have been adapted to execute on CPUs with heterogeneous cores, which basically all major ones have. E.g. Windows, Linux, Apple iOS/macOS, and Android OS schedulers have all been are adapted for this.
 

StefanR5R

Elite Member
Dec 10, 2016
6,589
10,375
136
Indeed, I hope I did not give a suggestion it does.
When I added the big.LITTLE reference, I still had the discussion line of #5,235 in mind but neglected to quote it.
Btw, software optimization guides from AMD and Intel also suggest to game developers to be mindful of thread scheduling;)
Do the guides also have suggestions to developers how to talk with their managers about the virtues of performance optimization? :-)
 

StefanR5R

Elite Member
Dec 10, 2016
6,589
10,375
136
somehow i don't have confidence in this OS for scheduling
It's not just the OS. Also think about applications with, for example, OpenMP threadpool. How are those threads supposed to know that a few of them need to work on bigger data subdomains than the other threads, if the performance optimum is that all threads finish together? (And how much bigger precisely?) And is it even possible to the particular application to divide the data into subdomains of uneven sizes?

One poster here thinks that all parallel computing works like he has seen it in Cinebench.

Anyway. More relevant to Zen 6 Speculation seems to me an introduction of a low-power island of cores, which is not about performance but about battery life during near-idle scenarios. Totally different case, but also anything but trivial for OS schedulers [edit, and their power management drivers]. Energy-aware scheduling is still a work in progress, AFAIK.
 
Last edited:

Fjodor2001

Diamond Member
Feb 6, 2010
4,162
568
126
It's not just the OS. Also think about applications with, for example, OpenMP threadpool. How are those threads supposed to know that a few of them need to work on bigger data subdomains than the other threads, if the performance optimum is that all threads finish together? (And how much bigger precisely?) And is it even possible to the particular application to divide the data into subdomains of uneven sizes?
Such problems are not specific to heterogeneous CPUs.

You'll have similar problems e.g. for cores that are running on different turbo frequency, even if you're using a homogeneous CPU. And often there is no way to precisely know beforehand how to divide all work among the CPU cores at the beginning anyway, because it's hard to deterministically know how long time all the sub-tasks included in the work will take. So it good to solve these issues regardless of whether you're using a homogeneous or heterogeneous CPU.

Depending on the work that needs to be done there are different solutions to this. You can e.g. divide the work into chunks, and once a thread is done with its current chunk it dynamically gets assigned another chunk to start working on. Some cores will complete more chunks than others, depending on how fast the cores are and how much work turns out to be needed for each of the various chunks in practice. As a practical example, some cores may compile more source code files than other ones, before the final end time when all work is done.
 
Last edited:

OneEng2

Senior member
Sep 19, 2022
730
978
106
This is a Zen 6 thread, NOT an Intel one.
Agree; however, the discussion of architectural comparison and which paths might work out better seems to be pertinent to the subject.
Separate archs should still be better.
I personally don't think so.
What are the other down sides other than cost though?
AMD is rumored to be diverging with Zen 7 IIRC, and Qualcomm and Apple also have separate P and E-core archs.
I would imagine an architecture specially designed to hit a certain frequency at a certain voltage given an area constraint is going to be better than an architecture designed for something else, and then being scaled down using physical design means. Ig the real question is if the PPA differences between the two approaches are worth the extra design effort.
First, I believe that the most important high core count market it DC or workstation. In both situations, having fully functional P cores with all the bells and whistles seems to perform better overall and at better efficiency (at least from what I have seen so far).

In the desktop ... same discussion.

In Laptop .... now we have some interesting things to talk about.

One could argue that the idea of putting a LP Zen 6c core on die with the IOD is a great way to lower power and extend battery life while using the same P core design; however, this approach DOES fall prey to the weakness of the BIOS and OS scheduling issues.

If a high demand thread is placed on a P core that resides in a low clock, low cache area, it will perform badly. With this understanding, the question really comes down to PPA for the two approaches.... and perhaps the engineering resources needed to support and maintain 2 different architectures vs one.
Correct. Basically everyone is doing some version of big.LITTLE nowadays. Even AMD with the Zen5 vs 5C cores, and soon they’ll have LP cores with Zen6 as well.
I would say that until AMD moves a Zen 6 ish core to the IOD, that it is largely true that the architecture is nearly immune to OS scheduling inefficiencies unlike Intel.

Once AMD puts a Zen 6c on the IOD (Zen 6 LP), I think this distinction disappears.
Personally, I doubt that AMD will go big.little. The dense cores seem to work for them in server. They no place in client IMO.
I think that low power cores are vital in laptop designs.
Client desktop, it's basically a wash. Client mobile? That's a different beast. For mobile, you absolutely have to deal with two big differences: idle power draw and sustained heavy MT thermals. To handle those, you need cores optimized for both cases. And, unsurprisingly, that what we're seeing with AMD. Desktop, save for the odd mobile derived APU, is homogeneous big cores. Mobile is either power restricted desktop parts (for now) and their implementation of big.LITTLE.
100% agree!
Because compilers and thread schedulers expect that parallelized programs will be executed on homogeneous hardware.
I think this is changing. With lots of integrated GPU units, NPU units, LP cores, P cores, etc, I think it is very likely that OS scheduling is in for a big overhaul to handle these things much more effectively in the future.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,371
2,996
136
I am coming to believe that it's tasks like laptop processor power management that AI will find a niche that it really excels in. We need to address the power draw of actually running the AI itself, but, having something that's trained and laser focused on managing scheduling and processor power states that has even a modicum of intelligence would go a VERY long way to optimizing all of this. I know that AMD has, in the past, touted their use of "AI" when it comes to managing the internals of their processors, but my opinion is that, while it seems to be keeping them competitive in most cases, it's far too limited in capability at this point to tackle what we're talking about.
 
  • Like
Reactions: Tlh97 and 511

StefanR5R

Elite Member
Dec 10, 2016
6,589
10,375
136
I am coming to believe that it's tasks like laptop processor power management that AI will find a niche that it really excels in. We need to address the power draw of actually running the AI itself, but, having something that's trained and laser focused on managing scheduling and processor power states that has even a modicum of intelligence would go a VERY long way to optimizing all of this. I know that AMD has, in the past, touted their use of "AI" when it comes to managing the internals of their processors, but my opinion is that, while it seems to be keeping them competitive in most cases, it's far too limited in capability at this point to tackle what we're talking about.
Yep, we have been having "AI" in CPUs for quite a while now: Branch predictors and prefetchers are not merely relying on hardwired heuristics, they actually employ machine learning.

Power management and thread scheduling is still basically a bunch of predetermined heuristics AFAIK. Might perhaps benefit from some added ML, indeed.
 
  • Like
Reactions: Tlh97 and marees