Discussion Intel current and future Lakes & Rapids thread

Page 387 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,633
10,845
136
In ideal conditions they will likely surpass Cezanne, but in a strange way - either workloads with strong emphasis on ST perf, or workloads with strong emphasis on MT perf ( 10+ threads, great MT scaling). Anything in between will likely run better or more consistent on a 8+0 chip.

The idea of 2 big cores + 4-8 small cores is not bad as power efficient, all purpouse gaming CPU. As gaming only needs around 2 strong cores. It is going to losse vs Renoir in MT workloads, but if the heat and power is lower it is still a very interesting idea.
It may also give the iGPU more tdp budget.

To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.
 

Hulk

Diamond Member
Oct 9, 1999
4,225
2,015
136
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.

Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.

Alder Lake is monolithic.
 
Feb 17, 2020
100
245
116
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores.

The Gracemont and Golden Cove cores are on the same die.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.

Why do you think that inter-core latency is an issue at all for audio applications? Audible latencies are orders of magnitudes above inter-core latencies....32ms is just a very long time span in this regard.
 

DrMrLordX

Lifer
Apr 27, 2000
21,633
10,845
136
Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.

We won't know until the silicon is tested.

Alder Lake is monolithic.

Interesting, I thought it was 2-3 chiplets. Regardless, Summit Ridge and Pinnacle Ridge were also monolithic.

The Gracemont and Golden Cove cores are on the same die.

They aren't on the same bus segments, are they?

Why do you think that inter-core latency is an issue at all for audio applications?

We've had more than a few posters come through the CPU forum making the claim.
 
  • Like
Reactions: Tlh97

Hulk

Diamond Member
Oct 9, 1999
4,225
2,015
136
Why do you think that inter-core latency is an issue at all for audio applications? Audible latencies are orders of magnitudes above inter-core latencies....32ms is just a very long time span in this regard.

Because I had read that previous Zen (before 3) had issues with low buffer settings with multitrack audio. But as you mentioned it does seem kind of silly when you consider the orders of magnitude difference between ns and ms! I believe that's 6 orders of magnitude, which of course is huge. A processor operating at 4GHz can do quite a bit of computing in 32ms. Shoot, when I used to program my old Atari 800 in assembly when the scan line finished moving across the screen, in the time it took to start drawing the next line I could swap out color registers and make it look like it could display more than 4 colors at a time. As long as those colors were in the same horizontal scan line;)
 
  • Like
Reactions: Tlh97 and NTMBK

dullard

Elite Member
May 21, 2001
25,066
3,415
126
About that thread scheduling complexity for Big/Little...

If you have a certain number of Big/Little cores it *seems* (I'm not an expert, admitted) like the goal is to run all the various CPU's with as little context switching as possible. If a core working on threads is switching like mad then it should be allocated less (this was edited I incorrectly posted "more" initially, sorry for the confusion) compute intensive threads. If the scheduler has an idea of the compute of each core it should be able to optimize the core-to-thread assignments by finding the combination that results in the least context switching. Yeah, I know easier said than accomplished but this seems like an ideal problem for machine learning as your rig would "learn" the best way to operate based on your use patterns.
I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

1) Windows gets its own little core(s). Now everything can be always-on. Now, the GUI thread is always instantly responsive (no more visible lag, mouse lag, or keyboard lag when the CPU is doing hard work).
2) Security gets its own little core(s). Virus / threat scanners can run 100% of the time on 100% of the files without affecting your main programs. Imagine the computer constantly performing facial recognition to verify that you are the correct person to use that account without any performance penalty on the other programs.
3) File system gets its own little core(s). Even better/faster data access. Full and proper encryption on everything without a performance hit.
4) Everything runs on the little cores for ultra-low power until you need to bring out the big guns.
5) AVX tasks are offloaded to the little cores so the big cores don't have to run at a slower speed (no more AVX offset).
6) AVX-512 threads are on the big cores and the other threads are on the little cores, so AVX-512 can be much more utilized. No more excuse to not include AVX-512 due to possible performance penalties on the rest of the threads (similar to #5, no more AVX frequency offset).
7) The thread schedule could be built around the priority system that programmers already give. https://docs.microsoft.com/en-us/dotnet/api/system.threading.threadpriority?view=netframework-4.7.2&f1url=?appId=Dev16IDEF1&l=EN-US&k=k(System.Threading.ThreadPriority);k(TargetFrameworkMoniker-.NETFramework,Version%3Dv4.7.2);k(DevLang-csharp)&rd=true
8) Threads using the big cores are now balanced. Work loads across say 8 cores can be split equally since Windows won't be sharing a core(s). Right now you often have cores sitting idle waiting for other cores to finish up their task because the programmers can't know ahead of time what might be going on in the background. Thus, programmers never know how to best split up the workload. Moving the other tasks to the little cores makes the big cores better utilized since they can each do exactly their full share of work in the same amount of time.
9) New/newer computer features run on the little cores. Something like the Apple touch bar on their laptops or Cortana always running, etc.
10) Higher and longer turbo frequencies on the big cores, because the tasks that would normally be heating them up are on other parts of the CPU.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
6,203
11,909
136
To both posters: are we not forgetting intercore latency?
I did not forget about the probable latency problems, alas they don't really apply for the 15W TDP SKUs. Latency will be a subject for gaming oriented SKUs.

The Gracemont and Golden Cove cores are on the same die.
They are on the same die, but the small cores are grouped in a cluster of 4 cores which is then connected to the ring bus (8 small cores -> 2 clusters). There is a real possibility this will lead to a noticeable increase in latency when looking across clusters. The measurements on Lakefield certainly indicate there is a latency penalty to pay when jumping from a big core to a small core.

We'll have to see how this scales on desktop chips and whether Intel made changes to the interconnect.
 
  • Like
Reactions: Tlh97

Hulk

Diamond Member
Oct 9, 1999
4,225
2,015
136
I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

1) Windows gets its own little core(s). Now everything can be always-on. Now, the GUI thread is always instantly responsive (no more visible lag, mouse lag, or keyboard lag when the CPU is doing hard work).
2) Security gets its own little core(s). Virus / threat scanners can run 100% of the time on 100% of the files without affecting your main programs. Imagine the computer constantly performing facial recognition to verify that you are the correct person to use that account without any performance penalty on the other programs.
3) File system gets its own little core(s). Even better/faster data access. Full and proper encryption on everything without a performance hit.
4) Everything runs on the little cores for ultra-low power until you need to bring out the big guns.
5) AVX tasks are offloaded to the little cores so the big cores don't have to run at a slower speed (no more AVX offset).
6) AVX-512 threads are on the big cores and the other threads are on the little cores, so AVX-512 can be much more utilized. No more excuse to not include AVX-512 due to possible performance penalties on the rest of the threads (similar to #5, no more AVX frequency offset).
7) The thread schedule could be built around the priority system that programmers already give. https://docs.microsoft.com/en-us/dotnet/api/system.threading.threadpriority?view=netframework-4.7.2&f1url=?appId=Dev16IDEF1&l=EN-US&k=k(System.Threading.ThreadPriority);k(TargetFrameworkMoniker-.NETFramework,Version%3Dv4.7.2);k(DevLang-csharp)&rd=true
8) Threads using the big cores are now balanced. Work loads across say 8 cores can be split equally since Windows won't be sharing a core(s). Right now you often have cores sitting idle waiting for other cores to finish up their task because the programmers can't know ahead of time what might be going on in the background. Thus, programmers never know how to best split up the workload. Moving the other tasks to the little cores makes the big cores better utilized since they can each do exactly their full share of work in the same amount of time.
9) New/newer computer features run on the little cores. Something like the Apple touch bar on their laptops or Cortana always running, etc.
10) Higher and longer turbo frequencies on the big cores, because the tasks that would normally be heating them up are on other parts of the CPU.

I must admit that I have the same instinct to dedicate cores to certain functions. But I also wonder if a more complex, finer grained multithreading scheduler will ultimately lead to a better performing system. Things that determine system responsiveness like mouse movements, keyboard input, scrolling, etc.. could be assigned higher priority than other more background oriented compute tasks. As I wrote above a really smart scheduler would optimize the system by keeping high priority threads at the top of the execution queue while simultaneously assigning threads to cores to minimize context switching while maximizing CPU usage.

It's certainly not an easy problem but I think a well written scheduler wouldn't even need to know the compute strength of each core. It would optimize based on certain system variables and available system resources. I think a similar parallel in programming might be current SSD controllers, which have become massively complex and smart over the past 10 years.

Exciting times.
 
  • Like
Reactions: Tlh97

andermans

Member
Sep 11, 2020
151
153
76
I'd also be concerned that reserving cores is a bad idea for a general purpose computer. Based on the workload that core is either going to be too little or idle.

I think for a stable workload a good scheduler will be able to find the threads that need high-performance cores, but for interactivity that may take too long and that is AFAIU where on platforms like Android the application can give some hints. Curious how well that hinting works in practice and if we might see something similar on Windows.
 

firewolfsm

Golden Member
Oct 16, 2005
1,848
29
91
Optimizing homogenous multicore performance would likely require moving threads from little to big cores periodically if keeping time is important for the program. That would entail a performance hit whether or not the scheduler is smart enough to do it (either cores will be waiting, or there will be a lot of cross core communication holding things up). Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.
 
  • Like
Reactions: Tlh97 and Hulk

Hulk

Diamond Member
Oct 9, 1999
4,225
2,015
136
Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.

Bubble sort for a scheduler!
 

coercitiv

Diamond Member
Jan 24, 2014
6,203
11,909
136
Since everybody is focused on the scheduler, here's what Intel showed us when they launched Lakefield, which is a 1+4 hybrid. It seems they are differentiating between foreground and background tasks as an additional way of determining whether to engage the big cores or not.

1615879012537.png
 
Last edited:
  • Like
Reactions: Elfear

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Optimizing homogenous multicore performance would likely require moving threads from little to big cores periodically if keeping time is important for the program. That would entail a performance hit whether or not the scheduler is smart enough to do it (either cores will be waiting, or there will be a lot of cross core communication holding things up). Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.

I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

Not sure why you are speculating. The heterogenous Windows scheduler is already implemented and is used for every device which features heterogenous core configurations. This is the case for every ARM device since 2018.
The main algorithm continuously monitors if a thread saturates the cores and potentially moves threads up or down accordingly - thats it in a nutshell. Keep in mind, that this is just one metric used for scheduling. This metric is also used to determine the core frequencies.
In any case, you can see precisely how the algorithm works, if you do few experiments with existing big.LITTLE devices. (Do not use Lakefield as reference, as it has, contrary to the ARM devices, other HW related scheduling restrictions)
 
Last edited:

dullard

Elite Member
May 21, 2001
25,066
3,415
126
Not sure why you are speculating. The heterogenous Windows scheduler is already implemented and is used for every device which features heterogenous core configurations. This is the case for every ARM device since 2018.
I am speculating because I am not sure that a hastily done niche use of Windows on ARM represents the entire future of Windows on x86.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
I am speculating because I am not sure that a hastily done niche use of Windows on ARM represents the entire future of Windows on x86.

Lol. There was a problem to solve, namely heterogenous core scheduling and it has been implemented. Besides ARM big.LITTE was the main driving factor of doing significant research in the field of heterogenous scheduling in the last 10 years. You can hardly find any research paper in this field without either referencing ARM big.LITTLE or doing prototype implementation on an ARM system - this includes research done by Microsoft itself.
So spare me you "hastily done" - rant.

For people who are interested, there is an article about the evolution of the Windows kernel from Microsoft from 2018, which is also touching the question of how ARM64 support was driving the inclusion of heterogenous scheduling into the Windows kernel. The Windows Kernel is indeed architecture independent - as mentioned in the article as well.

One Kernel
 
Last edited:
  • Like
Reactions: Tlh97 and Hulk

Hulk

Diamond Member
Oct 9, 1999
4,225
2,015
136
Lol. There was a problem to solve, namely heterogenous core scheduling and it has been implemented. Besides ARM big.LITTE was the main driving factor of doing significant research in the field of heterogenous scheduling in the last 10 years. You can hardly find any research paper in this field without either referencing ARM big.LITTLE or doing prototype implementation on an ARM system - this includes research done by Microsoft itself.
So spare me you "hastily done" - rant.

For people who are interested, there is an article about the evolution of the Windows kernel from Microsoft from 2018, which is also touching the question of how ARM64 support was driving the inclusion of heterogenous scheduling into the Windows kernel. The Windows Kernel is indeed architecture independent - as mentioned in the article as well.

One Kernel

Good info. Thanks. When a core moves "up" if saturated I assume that means up in compute priority?
 

cortexa99

Senior member
Jul 2, 2018
319
505
136
Price:

some guys start selling 11700/11700f slightly cheaper than 5600x in my country. IMO price/perf ratio is very good (*just for cpu. now all we need is a beefy mobo which might not be cheap*)
 
Feb 17, 2020
100
245
116
They are on the same die, but the small cores are grouped in a cluster of 4 cores which is then connected to the ring bus (8 small cores -> 2 clusters). There is a real possibility this will lead to a noticeable increase in latency when looking across clusters. The measurements on Lakefield certainly indicate there is a latency penalty to pay when jumping from a big core to a small core.

I agree that there will likely be a latency penalty, but the original post claimed they were on separate dies which is obviously false.
 

dullard

Elite Member
May 21, 2001
25,066
3,415
126
Interesting how you essentially just linked to item #7 on my list. From your article:
(As an aside, you can also programmatically mark your thread as unimportant which will make it run on the LITTLE core.)
You are brushing aside the vast differences in upcoming CPUs as if past ARM chips had to cover every possible situation (such as mix of AVX capabilities on different cores on the same chip). You also ignored that Microsoft later decided to toss some of that scheduling out and just go with reserved core and reserved threads for the operating system (also on my list) for the Xbox Series X (item #1 on my list).
 
Last edited:
  • Like
Reactions: Tlh97