Discussion Intel current and future Lakes & Rapids thread

DrMrLordX · Mar 15, 2021

coercitiv said:
In ideal conditions they will likely surpass Cezanne, but in a strange way - either workloads with strong emphasis on ST perf, or workloads with strong emphasis on MT perf ( 10+ threads, great MT scaling). Anything in between will likely run better or more consistent on a 8+0 chip.

Shivansps said:
The idea of 2 big cores + 4-8 small cores is not bad as power efficient, all purpouse gaming CPU. As gaming only needs around 2 strong cores. It is going to losse vs Renoir in MT workloads, but if the heat and power is lower it is still a very interesting idea.
It may also give the iGPU more tdp budget.

To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.

Hulk · Mar 15, 2021

DrMrLordX said:
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.

Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.

Exist50 · Mar 15, 2021

DrMrLordX said:
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores. For gaming that's a big no-no, and in certainly "embarassingly parallel" workloads where mem latency and intercore latency is an issue (such as audio production) that's also a bad approach.

I'm of the opinion that Alder Lake will struggle with latency whenever you try to make one application attempt to utilize cores across both Golden Cove and Gracemont cores unless the application has almost zero intercore/intercache traffic.

Alder Lake is monolithic.

trexfromouterspace · Mar 15, 2021

DrMrLordX said:
To both posters: are we not forgetting intercore latency? Alder Lake will not feature cores sharing the same ring bus as equal partners, but instead separate dice likely connected via EMIB with unknown latency penalties for any intercore communication between Golden Cove and Gracemont cores.

The Gracemont and Golden Cove cores are on the same die.

Thala · Mar 15, 2021

Hulk said:
Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.

Why do you think that inter-core latency is an issue at all for audio applications? Audible latencies are orders of magnitudes above inter-core latencies....32ms is just a very long time span in this regard.

DrMrLordX · Mar 15, 2021

Hulk said:
Would this inter-core latency be worse than the chiplet-to-chiplet latency or chiplet-to-controller (can't remember the proper term at the moment) latency for Zen 3? The reason I'm asking is because I had the same concerns regarding audio production. A few days ago I watched a review where this guy played back almost twice the number of tracks before glitching on a 5950X as a 9900K. It was like 29 to 57 tracks or something like that. He was running a very tight 32ms buffer as well. I would except Zen 3 to have the compute to do this but I didn't think it would perform at 2X the 9900K due to it's non-monolithic design.

We won't know until the silicon is tested.

Exist50 said:
Alder Lake is monolithic.

Interesting, I thought it was 2-3 chiplets. Regardless, Summit Ridge and Pinnacle Ridge were also monolithic.

trexfromouterspace said:
The Gracemont and Golden Cove cores are on the same die.

They aren't on the same bus segments, are they?

Thala said:
Why do you think that inter-core latency is an issue at all for audio applications?

We've had more than a few posters come through the CPU forum making the claim.

Hulk · Mar 15, 2021

Thala said:
Why do you think that inter-core latency is an issue at all for audio applications? Audible latencies are orders of magnitudes above inter-core latencies....32ms is just a very long time span in this regard.

Because I had read that previous Zen (before 3) had issues with low buffer settings with multitrack audio. But as you mentioned it does seem kind of silly when you consider the orders of magnitude difference between ns and ms! I believe that's 6 orders of magnitude, which of course is huge. A processor operating at 4GHz can do quite a bit of computing in 32ms. Shoot, when I used to program my old Atari 800 in assembly when the scan line finished moving across the screen, in the time it took to start drawing the next line I could swap out color registers and make it look like it could display more than 4 colors at a time. As long as those colors were in the same horizontal scan line

dullard · Mar 15, 2021

Hulk said:
About that thread scheduling complexity for Big/Little...

If you have a certain number of Big/Little cores it *seems* (I'm not an expert, admitted) like the goal is to run all the various CPU's with as little context switching as possible. If a core working on threads is switching like mad then it should be allocated less (this was edited I incorrectly posted "more" initially, sorry for the confusion) compute intensive threads. If the scheduler has an idea of the compute of each core it should be able to optimize the core-to-thread assignments by finding the combination that results in the least context switching. Yeah, I know easier said than accomplished but this seems like an ideal problem for machine learning as your rig would "learn" the best way to operate based on your use patterns.

I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

1) Windows gets its own little core(s). Now everything can be always-on. Now, the GUI thread is always instantly responsive (no more visible lag, mouse lag, or keyboard lag when the CPU is doing hard work).
2) Security gets its own little core(s). Virus / threat scanners can run 100% of the time on 100% of the files without affecting your main programs. Imagine the computer constantly performing facial recognition to verify that you are the correct person to use that account without any performance penalty on the other programs.
3) File system gets its own little core(s). Even better/faster data access. Full and proper encryption on everything without a performance hit.
4) Everything runs on the little cores for ultra-low power until you need to bring out the big guns.
5) AVX tasks are offloaded to the little cores so the big cores don't have to run at a slower speed (no more AVX offset).
6) AVX-512 threads are on the big cores and the other threads are on the little cores, so AVX-512 can be much more utilized. No more excuse to not include AVX-512 due to possible performance penalties on the rest of the threads (similar to #5, no more AVX frequency offset).
7) The thread schedule could be built around the priority system that programmers already give. https://docs.microsoft.com/en-us/dotnet/api/system.threading.threadpriority?view=netframework-4.7.2&f1url=?appId=Dev16IDEF1&l=EN-US&k=k(System.Threading.ThreadPriority);k(TargetFrameworkMoniker-.NETFramework,Version%3Dv4.7.2);k(DevLang-csharp)&rd=true
8) Threads using the big cores are now balanced. Work loads across say 8 cores can be split equally since Windows won't be sharing a core(s). Right now you often have cores sitting idle waiting for other cores to finish up their task because the programmers can't know ahead of time what might be going on in the background. Thus, programmers never know how to best split up the workload. Moving the other tasks to the little cores makes the big cores better utilized since they can each do exactly their full share of work in the same amount of time.
9) New/newer computer features run on the little cores. Something like the Apple touch bar on their laptops or Cortana always running, etc.
10) Higher and longer turbo frequencies on the big cores, because the tasks that would normally be heating them up are on other parts of the CPU.

coercitiv · Mar 15, 2021

DrMrLordX said:
To both posters: are we not forgetting intercore latency?

I did not forget about the probable latency problems, alas they don't really apply for the 15W TDP SKUs. Latency will be a subject for gaming oriented SKUs.

trexfromouterspace said:
The Gracemont and Golden Cove cores are on the same die.

They are on the same die, but the small cores are grouped in a cluster of 4 cores which is then connected to the ring bus (8 small cores -> 2 clusters). There is a real possibility this will lead to a noticeable increase in latency when looking across clusters. The measurements on Lakefield certainly indicate there is a latency penalty to pay when jumping from a big core to a small core.

We'll have to see how this scales on desktop chips and whether Intel made changes to the interconnect.

Hulk · Mar 15, 2021

dullard said:
I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

1) Windows gets its own little core(s). Now everything can be always-on. Now, the GUI thread is always instantly responsive (no more visible lag, mouse lag, or keyboard lag when the CPU is doing hard work).
2) Security gets its own little core(s). Virus / threat scanners can run 100% of the time on 100% of the files without affecting your main programs. Imagine the computer constantly performing facial recognition to verify that you are the correct person to use that account without any performance penalty on the other programs.
3) File system gets its own little core(s). Even better/faster data access. Full and proper encryption on everything without a performance hit.
4) Everything runs on the little cores for ultra-low power until you need to bring out the big guns.
5) AVX tasks are offloaded to the little cores so the big cores don't have to run at a slower speed (no more AVX offset).
6) AVX-512 threads are on the big cores and the other threads are on the little cores, so AVX-512 can be much more utilized. No more excuse to not include AVX-512 due to possible performance penalties on the rest of the threads (similar to #5, no more AVX frequency offset).
7) The thread schedule could be built around the priority system that programmers already give. https://docs.microsoft.com/en-us/dotnet/api/system.threading.threadpriority?view=netframework-4.7.2&f1url=?appId=Dev16IDEF1&l=EN-US&k=k(System.Threading.ThreadPriority);k(TargetFrameworkMoniker-.NETFramework,Version%3Dv4.7.2);k(DevLang-csharp)&rd=true
8) Threads using the big cores are now balanced. Work loads across say 8 cores can be split equally since Windows won't be sharing a core(s). Right now you often have cores sitting idle waiting for other cores to finish up their task because the programmers can't know ahead of time what might be going on in the background. Thus, programmers never know how to best split up the workload. Moving the other tasks to the little cores makes the big cores better utilized since they can each do exactly their full share of work in the same amount of time.
9) New/newer computer features run on the little cores. Something like the Apple touch bar on their laptops or Cortana always running, etc.
10) Higher and longer turbo frequencies on the big cores, because the tasks that would normally be heating them up are on other parts of the CPU.

I must admit that I have the same instinct to dedicate cores to certain functions. But I also wonder if a more complex, finer grained multithreading scheduler will ultimately lead to a better performing system. Things that determine system responsiveness like mouse movements, keyboard input, scrolling, etc.. could be assigned higher priority than other more background oriented compute tasks. As I wrote above a really smart scheduler would optimize the system by keeping high priority threads at the top of the execution queue while simultaneously assigning threads to cores to minimize context switching while maximizing CPU usage.

It's certainly not an easy problem but I think a well written scheduler wouldn't even need to know the compute strength of each core. It would optimize based on certain system variables and available system resources. I think a similar parallel in programming might be current SSD controllers, which have become massively complex and smart over the past 10 years.

Exciting times.

andermans · Mar 15, 2021

I'd also be concerned that reserving cores is a bad idea for a general purpose computer. Based on the workload that core is either going to be too little or idle.

I think for a stable workload a good scheduler will be able to find the threads that need high-performance cores, but for interactivity that may take too long and that is AFAIU where on platforms like Android the application can give some hints. Curious how well that hinting works in practice and if we might see something similar on Windows.

firewolfsm · Mar 15, 2021

Optimizing homogenous multicore performance would likely require moving threads from little to big cores periodically if keeping time is important for the program. That would entail a performance hit whether or not the scheduler is smart enough to do it (either cores will be waiting, or there will be a lot of cross core communication holding things up). Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.

Hulk · Mar 15, 2021

firewolfsm said:
Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.

Bubble sort for a scheduler!

coercitiv · Mar 16, 2021

Since everybody is focused on the scheduler, here's what Intel showed us when they launched Lakefield, which is a 1+4 hybrid. It seems they are differentiating between foreground and background tasks as an additional way of determining whether to engage the big cores or not.

yuri69 · Mar 16, 2021

Alder Lake-S ES paired with both DDR4 and DDR5 pictured

mydrivers.com

mikk · Mar 16, 2021

i7-11800H on Geekbench.

https://twitter.com/x/status/1371737792460427265

mikk · Mar 16, 2021

ADL-S DDR5:

https://twitter.com/x/status/1371775725775241218

Interesting detail: 512 compute units, 1.80 Ghz, 12GB VRAM......could be DG2.

Thala · Mar 16, 2021

firewolfsm said:
Optimizing homogenous multicore performance would likely require moving threads from little to big cores periodically if keeping time is important for the program. That would entail a performance hit whether or not the scheduler is smart enough to do it (either cores will be waiting, or there will be a lot of cross core communication holding things up). Heterogeneous multicore might be easier as threads which don't saturate the big cores can just be moved down. In this case the big.little approach will benefit multicore performance since the little cores will essentially free up TDP for the big cores to clock higher.
I'm sure they'll come up with super clever algorithms.

dullard said:
I'll start by saying I don't know exactly how Microsoft and Intel plan to do the scheduling. But, I can certainly speculate some possible simple and useful methods. Maybe none of these will happen, but I can certainly see great uses that don't require much effort at all with thread scheduling (some ideas actually make thread scheduling easier):

Not sure why you are speculating. The heterogenous Windows scheduler is already implemented and is used for every device which features heterogenous core configurations. This is the case for every ARM device since 2018.
The main algorithm continuously monitors if a thread saturates the cores and potentially moves threads up or down accordingly - thats it in a nutshell. Keep in mind, that this is just one metric used for scheduling. This metric is also used to determine the core frequencies.
In any case, you can see precisely how the algorithm works, if you do few experiments with existing big.LITTLE devices. (Do not use Lakefield as reference, as it has, contrary to the ARM devices, other HW related scheduling restrictions)

dullard · Mar 16, 2021

Thala said:
Not sure why you are speculating. The heterogenous Windows scheduler is already implemented and is used for every device which features heterogenous core configurations. This is the case for every ARM device since 2018.

I am speculating because I am not sure that a hastily done niche use of Windows on ARM represents the entire future of Windows on x86.

Thala · Mar 16, 2021

dullard said:
I am speculating because I am not sure that a hastily done niche use of Windows on ARM represents the entire future of Windows on x86.

Lol. There was a problem to solve, namely heterogenous core scheduling and it has been implemented. Besides ARM big.LITTE was the main driving factor of doing significant research in the field of heterogenous scheduling in the last 10 years. You can hardly find any research paper in this field without either referencing ARM big.LITTLE or doing prototype implementation on an ARM system - this includes research done by Microsoft itself.
So spare me you "hastily done" - rant.

For people who are interested, there is an article about the evolution of the Windows kernel from Microsoft from 2018, which is also touching the question of how ARM64 support was driving the inclusion of heterogenous scheduling into the Windows kernel. The Windows Kernel is indeed architecture independent - as mentioned in the article as well.

One Kernel

Hulk · Mar 16, 2021

Thala said:
Lol. There was a problem to solve, namely heterogenous core scheduling and it has been implemented. Besides ARM big.LITTE was the main driving factor of doing significant research in the field of heterogenous scheduling in the last 10 years. You can hardly find any research paper in this field without either referencing ARM big.LITTLE or doing prototype implementation on an ARM system - this includes research done by Microsoft itself.
So spare me you "hastily done" - rant.

For people who are interested, there is an article about the evolution of the Windows kernel from Microsoft from 2018, which is also touching the question of how ARM64 support was driving the inclusion of heterogenous scheduling into the Windows kernel. The Windows Kernel is indeed architecture independent - as mentioned in the article as well.

One Kernel

Good info. Thanks. When a core moves "up" if saturated I assume that means up in compute priority?

RTX2080 · Mar 16, 2021

Price:

Intel 11th Gen Core "Rocket Lake" series official specifications and pricing - VideoCardz.com

Intel Core i9-11900K to cost 539 USD We have the final specification and pricing for the upcoming 11th Gen Core series. The flagship 8-core i9-11900K SKU will retail at 539 USD (pricing per 1000 units), an increase of 51 USD per CPU compared to Core i9-10900K. This CPU will boost up to 5.3 GHz...

videocardz.com

some guys start selling 11700/11700f slightly cheaper than 5600x in my country. IMO price/perf ratio is very good (*just for cpu. now all we need is a beefy mobo which might not be cheap*)

trexfromouterspace · Mar 16, 2021

coercitiv said:
They are on the same die, but the small cores are grouped in a cluster of 4 cores which is then connected to the ring bus (8 small cores -> 2 clusters). There is a real possibility this will lead to a noticeable increase in latency when looking across clusters. The measurements on Lakefield certainly indicate there is a latency penalty to pay when jumping from a big core to a small core.

I agree that there will likely be a latency penalty, but the original post claimed they were on separate dies which is obviously false.

dullard · Mar 16, 2021

Thala said:
One Kernel

Interesting how you essentially just linked to item #7 on my list. From your article:

(As an aside, you can also programmatically mark your thread as unimportant which will make it run on the LITTLE core.)

You are brushing aside the vast differences in upcoming CPUs as if past ARM chips had to cover every possible situation (such as mix of AVX capabilities on different cores on the same chip). You also ignored that Microsoft later decided to toss some of that scheduling out and just go with reserved core and reserved threads for the operating system (also on my list) for the Xbox Series X (item #1 on my list).

eek2121 · Mar 16, 2021

mikk said:
i7-11800H on Geekbench.

https://twitter.com/x/status/1371737792460427265

That is a solid result.

EDIT: appears to be good competition for cezanne.

5800H: https://browser.geekbench.com/v5/cpu/6974676

SC: 1452 / MC: 7637

Discussion Intel current and future Lakes & Rapids thread

Lifer

Diamond Member

Platinum Member

Member

Golden Member

Lifer

Diamond Member

Elite Member

Diamond Member

Diamond Member

Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Elite Member

Golden Member

Diamond Member

Senior member

Member

Elite Member

Diamond Member