Question Zen 6 Speculation Thread

Page 211 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BorisTheBlade82

Senior member
May 1, 2020
702
1,116
136
Not if they have been adapted to execute on CPUs with heterogeneous cores, which basically all major ones have. E.g. Windows, Linux, Apple iOS/macOS, and Android OS schedulers have all been are adapted for this.
Sorry, but I need to disagree. If that worked like a charm as you say, how come that Intel fused off AVX512 from Alder Lake onwards in order to present Compilers and OS a homogeneous ISA?
Even if Compilers were able to handle this, everything would have to be recompiled, which in itself is a far stretch.
Secondly, it is not just recompiling: The Code would need to be refactored in order to ask the Scheduler via Flags for certain ISA capabilities on a thread by thread basis at runtime.
 

OneEng2

Senior member
Sep 19, 2022
719
961
106
Sorry, but I need to disagree. If that worked like a charm as you say, how come that Intel fused off AVX512 from Alder Lake onwards in order to present Compilers and OS a homogeneous ISA?
Even if Compilers were able to handle this, everything would have to be recompiled, which in itself is a far stretch.
Secondly, it is not just recompiling: The Code would need to be refactored in order to ask the Scheduler via Flags for certain ISA capabilities on a thread by thread basis at runtime.
That is what I was thinking as well..... and while today, you can ask for a task to be carried out in a thread and assign that thread to a core through "SetThreadAffinityMask". You would then need to use "GetLogicalProcessorInformationEx" to determine the P or E core status, but as of today, I am not aware of any way to target an LPE core.... and all of this requires the business logic of EACH program to be re-written to take advantage of the core architecture. Not ONLY is this a giant PITA, it may well be impossible to create logic that is well optimized for many processors under differing conditions.

This is why I think it belongs in the OS scheduler.

Still, I don't think these things are very mature at present.... making it easier for a uniform architecture to thrive vs a hybrid architecture.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,110
16,021
136
Sorry, but I need to disagree. If that worked like a charm as you say, how come that Intel fused off AVX512 from Alder Lake onwards in order to present Compilers and OS a homogeneous ISA?
The aldr lake took too much power , and they hit their max, I know personally,, as I had a 12700k.

sorry, I had to reply, I know this is Zen 6.. Back to topic please !
 
Last edited:

Doug S

Diamond Member
Feb 8, 2020
3,351
5,870
136
That is what I was thinking as well..... and while today, you can ask for a task to be carried out in a thread and assign that thread to a core through "SetThreadAffinityMask". You would then need to use "GetLogicalProcessorInformationEx" to determine the P or E core status, but as of today, I am not aware of any way to target an LPE core.... and all of this requires the business logic of EACH program to be re-written to take advantage of the core architecture. Not ONLY is this a giant PITA, it may well be impossible to create logic that is well optimized for many processors under differing conditions.

This is why I think it belongs in the OS scheduler.

Still, I don't think these things are very mature at present.... making it easier for a uniform architecture to thrive vs a hybrid architecture.

Even if you control it from the standpoint of your code, the OS itself may use AVX512 in some places. For example, memcpy() might use it for longer copies. Obviously it checks for support before using that code path, but what happens if your process is switched out between the time it checks for AVX512 support and the time it executes the last AVX512 instruction? If it gets rescheduled on an E core which doesn't support AVX512 then your process is killed due to an illegal instruction fault.

Now sure there are ways to mitigate that. Maybe when an E core hits an AVX512 instruction instead of faulting with "illegal instruction", it defines a new fault "unsupported instruction". Then the OS can be written so that if it sees that, it will reschedule the process to run on a P core which is able to execute that instruction and any other AVX512 instructions that may follow. It might suck for performance though, as it could be a while (relatively speaking) before a P core is available to continue execution on.

This isn't the only corner case it would have to handle, and if Microsoft told Intel "we don't want to deal with that kind of stuff, you need to make all cores able to execute the same instructions" Intel wouldn't have much choice.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,370
2,990
136
Keep in mind, the various core types don't all have to execute every instruction particularly well. AMD could make a processor with a few full fat p cores, Zen5 mobile style half width AVX512 C cores, and a few heavily microcoded, quarter width minimum function spec cores that abandon all pretenses of performance in favor of absolute minimum power draw and maximum density achievable LPE cores. As long as it can properly execute the full advertised instruction gamut, the OS really doesn't care how long it takes.
 
  • Like
Reactions: Kryohi

OneEng2

Senior member
Sep 19, 2022
719
961
106
I wasn't thinking about the "can't execute" problem, but more of the "shouldn't go there" situation because it would be slow.

Both are valid problems though.... and as your example shows, a big problem.

AMD's approach of using the same core architecture gets around the "can't execute" problem, but doesn't get around the "shouldn't go there" problem.

Without knowledge of which threads should be executed on which level of performance cores, things get messed up and slow down.

This is particularly true of the LPE cores. This is an architectural direction I believe both AMD Zen 6 and Intel Nova Lake are perusing.
 

fastandfurious6

Senior member
Jun 1, 2024
615
774
96
LPE cores make a lot of sense, E cores not so much

the ecosystem is now mature enough to streamline essential OS and core software loops (i.e. browsers) and throw them into LPE cores..... PEG THEM into those tiny little caches and dedicated little cores... For Those Who Do Not Benefit From P Cores

power/battery consumption will drop to nothing and P cores will be more free for what actually reaps the benefits for snappiness


but E cores... E cores is a mess.... stuff gets shuffled ALL the time between core clusters and it becomes a sluggish mess when theres too much stuff running
 

LightningZ71

Platinum Member
Mar 10, 2017
2,370
2,990
136
Perhaps it is possible for Windows to take topology hints from chipset drivers that would push storage I/O operations to the LPE cores that are located on the I/O die? It makes no sense to saturate the CCD link with disk I/O traffic that isn't necessarily needed by any of those cores. If the file system is doing housekeeping, or an AV scanner or auditing process is banging away at the SSD, why drag that stuff into the CCD? Even the LPE cores will be plenty fast enough to keep up with disk IO without pushing too high in the VF curve. Same thing with network I/O.
 

fastandfurious6

Senior member
Jun 1, 2024
615
774
96
Perhaps it is possible for Windows to take topology hints from chipset drivers that would push storage I/O operations to the LPE cores that are located on the I/O die? It makes no sense to saturate the CCD link with disk I/O traffic that isn't necessarily needed by any of those cores. If the file system is doing housekeeping, or an AV scanner or auditing process is banging away at the SSD, why drag that stuff into the CCD? Even the LPE cores will be plenty fast enough to keep up with disk IO without pushing too high in the VF curve. Same thing with network I/O.

100%

even highest-end pcs get sluggish when lots of IO going on

no? that would happen with C cores as well cause the Performance gap is too large

come on tell me now much is big I paying you 😭🤣 jk
 
  • Like
Reactions: Tlh97 and Gideon

BorisTheBlade82

Senior member
May 1, 2020
702
1,116
136
Perhaps it is possible for Windows to take topology hints from chipset drivers that would push storage I/O operations to the LPE cores that are located on the I/O die? It makes no sense to saturate the CCD link with disk I/O traffic that isn't necessarily needed by any of those cores. If the file system is doing housekeeping, or an AV scanner or auditing process is banging away at the SSD, why drag that stuff into the CCD? Even the LPE cores will be plenty fast enough to keep up with disk IO without pushing too high in the VF curve. Same thing with network I/O.
Idk. This would need a very deep understanding by the Scheduler of what a thread actually does - at runtime.
IMHO nowadays, Schedulers more or less only regard the applied priorities and some Monitoring Data like current Processing demand.
I personally think that every form of categorization, Allow-Listing or Flagging by Developers would be a Failure, as a lot of Developers will simply stamp their Applications as being the most important ones in the Universe 😄
 
  • Haha
Reactions: Thibsie

fastandfurious6

Senior member
Jun 1, 2024
615
774
96
should be very easy to flag core OS functions.........

beyond that what's needed is a solid framework and visibility for scheduling and priority

in an ideal world that could be visible and tweakable by users even... but the state of real software code is a total nightmare this is why scheduling is still so opaque
 

StefanR5R

Elite Member
Dec 10, 2016
6,580
10,355
136
Perhaps it is possible for Windows to take topology hints from chipset drivers that would push storage I/O operations to the LPE cores that are located on the I/O die?
This would need a very deep understanding by the Scheduler of what a thread actually does - at runtime.
I/O wait times of a thread should already be known to the kernel.
Might not work as intended though if the respective application has got separate I/O initiator and data processing threads.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,370
2,990
136
Shouldn't really matter. I/O requests are made to the OS anyway in most situations. The OS should have perfect understanding of what is a disk I/O request and what isn't.

To a certain extent, this is similar to what Apple was doing with the T1/T2 coprocessor on their PCs and laptops, functions that have now been migrated into the M series processors...
 

Doug S

Diamond Member
Feb 8, 2020
3,351
5,870
136
I wasn't thinking about the "can't execute" problem, but more of the "shouldn't go there" situation because it would be slow.

Both are valid problems though.... and as your example shows, a big problem.

AMD's approach of using the same core architecture gets around the "can't execute" problem, but doesn't get around the "shouldn't go there" problem.

Without knowledge of which threads should be executed on which level of performance cores, things get messed up and slow down.

This is particularly true of the LPE cores. This is an architectural direction I believe both AMD Zen 6 and Intel Nova Lake are perusing.

You could mostly solve the "shouldn't go there" problem by having the CPU flags lie - a core might support sluggish AVX512 but when checked the CPU flags will claim that it doesn't support AVX512. That way it'll select a more effective non-AVX512 code path, but for the corner cases where a process is switched from a P core with good AVX512 performance to an E/LPE core with wimpy AVX512 performance in the middle of an AVX512 sequence it'll still complete, just more slowly than it otherwise would have.

There are alternate architectural decisions to consider as well. Look at what Apple did with SME2. From the time they created it as an Apple only "AMX" instruction group, it has been not a per core unit but per cluster. They considered it important to have, but not important enough that every core needs to have its own full sized AMX unit. From the standpoint of the instruction stream there is no difference, and indeed there would be no way to know other than if you tried to have two P cores execute AMX instructions at once (not sure what happens, I assume the second core will generate some sort of exception treated similiarly to when a process is waiting on I/O, and the scheduler handles requeueing that process to run when the AMX unit is free)

There's no reason small cores couldn't support AVX512 at the same speed as big cores while remaining area efficient - if your cluster of small cores all shared one (or more) full sized AVX512 execution resources. Then when they encounter AVX512 instructions they can run them at full speed, but small cores aren't likely to be scheduled for the kind of heavy number crunching tasks that grind away at long sequences of AVX512 instructions (at least not if your scheduler is doing its job correctly) so sharing should work fine unless you're trying to "win" at Cinebench.
 
  • Like
Reactions: Tlh97 and OneEng2

Fjodor2001

Diamond Member
Feb 6, 2010
4,153
563
126
Sorry, but I need to disagree. If that worked like a charm as you say, how come that Intel fused off AVX512 from Alder Lake onwards in order to present Compilers and OS a homogeneous ISA?
Even if Compilers were able to handle this, everything would have to be recompiled, which in itself is a far stretch.
Secondly, it is not just recompiling: The Code would need to be refactored in order to ask the Scheduler via Flags for certain ISA capabilities on a thread by thread basis at runtime.
There are solutions to these problems, otherwise all the CPU manufacturers I mentioned previously would not have chosen to go with big.LITTLE style CPUs. Apparently they concluded that the benefits are greater than the drawbacks.
 
  • Like
Reactions: 511

Fjodor2001

Diamond Member
Feb 6, 2010
4,153
563
126
Even if you control it from the standpoint of your code, the OS itself may use AVX512 in some places. For example, memcpy() might use it for longer copies. Obviously it checks for support before using that code path, but what happens if your process is switched out between the time it checks for AVX512 support and the time it executes the last AVX512 instruction? If it gets rescheduled on an E core which doesn't support AVX512 then your process is killed due to an illegal instruction fault.
The kernel could hold a lock preventing rescheduling of that thread while it’s executing instruction types that are not available on other core types. Or only allow rescheduling to other cores of same type while executing such instructions. The execution of one or a few AVX512 instruction takes very little time, so it should not be problem to wait for it to complete before rescheduling.

Also, this kind of low level stuff is usually done in libraries (or the OS). So the applications will get most of it ”for free” assuming they use those, and they in turn have been recompiled to make use of e.g. AVX512 when available. I.e. not necessary to recompile all applications, even if it may bring some additional benefits of course since then the code in the applications can also make use of such instructions directly.

By the way, this is also similar to e.g. realtime locks (mutexes/semaphores/…) that also are used to prevent rescheduling at undesirable times in the execution path.

It’s really all about typical and classical SW Engineering problems, for which there are solutions available.
 

MS_AT

Senior member
Jul 15, 2024
769
1,554
96
The kernel could hold a lock preventing rescheduling of that thread while it’s executing instruction types that are not available on other core types. Or only allow rescheduling to other cores of same type while executing such instructions. The execution of one or a few AVX512 instruction takes very little time, so it should not be problem to wait for it to complete before rescheduling.
Do you expect the OS to inspect instruction streams of the apps and to be able to do some sort of disassembly to be able to understand a forbidden instruction is going to run?;) Remember that OS ticks are milliseconds apart. It would need to inspect the code thousands of cycles ahead and then run its own branch predictor. That doesn't sound feasible really.

You have two choices, either the apps marks itself as requiring specific instruction set on load and the OS sets affinity mask for these apps to fit the instruction requirements, which sounds nice until you realise there is a sea of legacy code for which this trick won't work. That leaves you with the second choice, trap illegal instruction and reschedule what takes lots and lots of cycles.

Scanning whole app at load time looking for forbidden instructions also won't solve the problem sensibly as it will wrongly flag apps that compile in support for multiple instruction sets and do dynamic dispatch at runtime.
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,153
563
126
Do you expect the OS to inspect instruction streams of the apps and to be able to do some sort of disassembly to be able to understand a forbidden instruction is going to run?;) Remember that OS ticks are milliseconds apart. It would need to inspect the code thousands of cycles ahead and then run its own branch predictor. That doesn't sound feasible really.
You can put locks preventing rescheduling and checks at suitable intervals. Or have an exception handler be called automatically when an illegal instruction is called (e.g. AVX512 on a core that does not support it), and then deal with it there.

There are several different possible solutions and I’m not sure which one is best. But obviously it has been solved, since all major OSes support big.LITTLE style CPUs. I’d recommend digging in to the Linux kernel source code which is public if you want to find out the details of how it has been solved there:


But having said that, with regards to the specific big.LITTLE issue of having different instructions per core type, isn’t the solution simply to avoid having that in the first place? IIRC, for Arm the cores have the same ISA on both the big and the LITTLE cores, so the ”AVX512 only exists on one of the core types” kind of problem you described should not even occur.
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
769
1,554
96
You can put locks preventing rescheduling and checks at suitable intervals.
If you mean to disable interrupts around problematic code to ensure the kernel will not tick and try to reschedule when you are executing problematic code, then good luck to you;) How would you decide the granularity of critical section? After every instruction? After every 10 instructions?
Or have an exception handler be called automatically when an illegal instruction is called (e.g. AVX512 on a core that does not support it), and then deal with it there.
I did mention this option but it's costly. Syscalls are expensive.
But obviously it has been solved, since all major OSes support big.LITTLE style CPUs.
Yes, that's why all big.Little CPUs support uniform ISAs;) Unless you can point at one example when that is not the case?:)

Also if you are familiar with Linux kernel code, then it would be polite to point to specific parts that you are referring to, as this is enormous repository so not everyone can know how to navigate it. Otherwise you give off an impression, you are just throwing links around to appear smart.

But having said that, with regards to the specific big.LITTLE issue of having different instructions per core type, isn’t the solution simply to avoid having that in the first place? IIRC, for Arm the cores have the same ISA on both the big and the LITTLE cores, so the ”AVX512 only exists on one of the core types” kind of problem you described should not even occur.
Wasn't that one of the reasons Intel gave up on AVX512 on Alder Lake? You could never had AVX512 and E cores enabled at the same time iirc.
 

OneEng2

Senior member
Sep 19, 2022
719
961
106
Idk. This would need a very deep understanding by the Scheduler of what a thread actually does - at runtime.
IMHO nowadays, Schedulers more or less only regard the applied priorities and some Monitoring Data like current Processing demand.
I personally think that every form of categorization, Allow-Listing or Flagging by Developers would be a Failure, as a lot of Developers will simply stamp their Applications as being the most important ones in the Universe 😄
I agree, but it might be possible ..... in the future if ...
should be very easy to flag core OS functions.........

beyond that what's needed is a solid framework and visibility for scheduling and priority

in an ideal world that could be visible and tweakable by users even... but the state of real software code is a total nightmare this is why scheduling is still so opaque
You beat me to it.

Today, we simply tell the OS (in code) how "important" the code is. If we really want to get technical, we can tell the OS which specific core we would "prefer" to run it on. That's it.

What is needed is for threads to be able to set properties that stipulate enough information for an OS to determine on ANY CPU which cores are best used for which thread based on everything else that is going on AND what the properties of that thread are AND what the properties of the cores are.

Even then, this looks like a pretty complex task for the OS. I certainly believe this is where we are headed though.
 

Doug S

Diamond Member
Feb 8, 2020
3,351
5,870
136
The kernel could hold a lock preventing rescheduling of that thread while it’s executing instruction types that are not available on other core types.

And if you're running some sort of number crunching task that's basically executing AVX512 instructions in a loop (along with loads & stores) for hours? That would be a great way for DoS attacks - just insert a pointless AVX512 instruction in your code every other cycle and your process is always active. Do that on all cores and you grind the whole machine to a halt...