Question Alder Lake - Official Thread

Page 82 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hulk

Diamond Member
Oct 9, 1999
4,191
1,975
136
Intel sells the ecores in the way shown in these pictures.
Compared to a CPU without ecores it allows your foreground app to keep running at full speed while giving the background app a specified amount of compute that's always available to it, depending on the amount of e-cores.

The way you would keep your foreground app running at full speed on a CPU without e-cores would be to run the foreground app at real-time priority and the background app at idle priority meaning that the background app might never get any CPU cycles.

This is NOT mitigating potential responsiveness problems caused by the presence of E-cores, this is your foreground app getting the use of the full CPU (compared to a CPU without e-cores) meaning there is no slowdowns.
M8cB7Sr.jpg

bIwC9im.jpg

These slides are interesting and as usual it's a rabbit hole when you start to analyze them because so much of the testing procedures is left out. Specifically the length of the test scripts.

Let's look at the parallel processing example in the 2nd slide. Let's also assume that the video encoding task was selected so that 8 E cores will finish it more or less when the 8 P cores finish the two serial tasks. This of course is a best case scenario for the TH and requires no "smarts" at all. Foreground on the P's, background on the E's. Perfect for the current behavior of Alder Lake on Windows 11. The graph shows 47% faster than Rocket Lake, which is expect as ADL has more stronger big cores than Rocket, and 8 additional E cores. No surprise there.

It's also pretty easy to understand why when the tasks are completed in parallel Rocket is a little closer to ADL. These apps aren't utilizing the E's fully so most of the gains are probably coming from the P's and stronger memory subsystem. ie the apps are not highly multithreaded to effectively use 8+ cores.

Now imagine that the video encoding portion of the parallel processing task was twice as long. Under current ADL behavior after finish the foreground task if the user didn't switch to the background task the P's would remain idle until the E's finished. There is a video encoding test length where Rocket would actually win because this eventually comes down to 8 E's vs 8 Cypress Lake cores.

Another bad scenario for ADL is if in the first serial processing test after starting the tasks the user puts a web browser in the foreground. All of these tasks now go to the E's and Rocket Lake wins.

This would also happen in the parallel processing example if a web browser was the foreground app after starting the workload. Rocket would win against the 8 E's as the P's sat idle, or nearly so as I have seen on my computer when doing these same tasks.

Another bad situation for ADL/Intel is having three tasks running simultaneously. Now the P's could be "computing" chrome while the E's are compressing video and processing RAW photo files. A very inefficient use of compute that I run into all the time.

Is ADL superior to RKL? Yes, of course. But there is some user intervention necessary to extract that superiority. Intel kind of manipulated this test, it looks that way to me anyway, in order to put ADL in a better light than RKL.

The fix seems so simple I must be wrong in my thinking. Simply keep the P's assigned to the foreground task but ALSO allow them to work on background tasks while keeping the foreground task as the priority. So if you are editing a photo and need the P's for 5 seconds to process a filter on a photo they would immediately move to the foreground app, process the filter, and then move back to the background apps until the foreground app needed them again.

As I've written before I think this issue even without the adjustment I mentioned above is mitigate greatly with Raptor Lake and 16 E cores. That is basically a 3950X working on background tasks. I could live (happily) with that. But with the 12700K and only one Gracemont cluster this behavior is silly.
 

DrMrLordX

Lifer
Apr 27, 2000
21,583
10,785
136
The wide variety of tests ensures that everyone can look up what matters to them

Yes, but how many of the people simply pasting the geometric mean are actually doing that, or encouraging anyone else to do that?

Phoronix have their test suite and that's it, whatever takeaways you make from that test suite should factor in the tests in that test suite. That's it.

Exactly, though the geometric mean posted at the end of the test suite is pretty useless unless your workload mimics the test suite almost exactly.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
Exactly, though the geometric mean posted at the end of the test suite is pretty useless unless your workload mimics the test suite almost exactly.

Pretty much the same with all CPU review benchmark suites. The average is kind of useless to anyone, because no one matches their use case to a suite of benchmarks.

Which is why I only look at the things I actually do that load the CPU significantly. Which is Gaming and Video Encoding (very distant second, used to be higher when I recorded/encoded a lot of OTA TV).

The other home computing stuff (Web, Office Suite, Media consumption), doesn't even really tax a 10 year old 4 thread CPU.
 
  • Like
Reactions: Zucker2k

TheELF

Diamond Member
Dec 22, 2012
3,967
720
126
These slides are interesting and as usual it's a rabbit hole when you start to analyze them because so much of the testing procedures is left out. Specifically the length of the test scripts.
You have to realize that this is not being done for best performance or for best efficiency.
This is purely a Apple thing of providing the best user experience, it's about the noob streamer not losing FPS or dropping frames on the recording without having to do anything special, and it's about the noob content creator not having to wait any longer importing and exporting pics while converting their video.
Not feeling your system ever being bogged down is far more important than actual performance and if actual performance is still pretty high it's a win win.

The fix seems so simple I must be wrong in my thinking. Simply keep the P's assigned to the foreground task but ALSO allow them to work on background tasks while keeping the foreground task as the priority. So if you are editing a photo and need the P's for 5 seconds to process a filter on a photo they would immediately move to the foreground app, process the filter, and then move back to the background apps until the foreground app needed them again.
I'm sure it doesn't work well all the time yet, but the theory is that TD will actively allow all cores to work on a task or boot a low importance thread from a P core to run something new that is more important.
So basically what you are saying and more, I just have the suspicion that all apps and threads have to be running on the same priority for this to work correctly but I might be wrong.

Here is the difference between the previous pics which were targeted to noobs and what TD actually is supposed to do.
 

Hulk

Diamond Member
Oct 9, 1999
4,191
1,975
136
You have to realize that this is not being done for best performance or for best efficiency.
This is purely a Apple thing of providing the best user experience, it's about the noob streamer not losing FPS or dropping frames on the recording without having to do anything special, and it's about the noob content creator not having to wait any longer importing and exporting pics while converting their video.
Not feeling your system ever being bogged down is far more important than actual performance and if actual performance is still pretty high it's a win win.


I'm sure it doesn't work well all the time yet, but the theory is that TD will actively allow all cores to work on a task or boot a low importance thread from a P core to run something new that is more important.
So basically what you are saying and more, I just have the suspicion that all apps and threads have to be running on the same priority for this to work correctly but I might be wrong.

Here is the difference between the previous pics which were targeted to noobs and what TD actually is supposed to do.

I remember watching that video when it came out and being excited to see it in action. Unfortunately in practice it's not working the way she explains it. P cores will "spin" and do nothing while you browse the web even if the E cores are compressing video in Handbrake in the background.

As far as I can tell NONE of that stuff they are talking about with the TD is happening. I'm going to have a little fun with this now. It's more like, "We at Intel have decided to put the foreground application on the P cores because we need the highest possible benchmark scores. We then cleverly shove all of the background tasks on the E cores. Kind of like sweeping dirt under the rug but in a modern way. See, we're Intel and we make old new again!"

The most ironic part of this is that the time and money spent on that video could probably have been used to actually fix the thread director instead of advertising it.
 

dullard

Elite Member
May 21, 2001
24,998
3,327
126
I remember watching that video when it came out and being excited to see it in action. Unfortunately in practice it's not working the way she explains it. P cores will "spin" and do nothing while you browse the web even if the E cores are compressing video in Handbrake in the background.

As far as I can tell NONE of that stuff they are talking about with the TD is happening. I'm going to have a little fun with this now. It's more like, "We at Intel have decided to put the foreground application on the P cores because we need the highest possible benchmark scores. We then cleverly shove all of the background tasks on the E cores. Kind of like sweeping dirt under the rug but in a modern way. See, we're Intel and we make old new again!"

The most ironic part of this is that the time and money spent on that video could probably have been used to actually fix the thread director instead of advertising it.
No matter how much work Intel puts into it, it won't be enough. That is because Windows can just override the Thread Director at its whimsy. The right solution is the a very long-term solution: recompile the software to properly place code on the right cores. By the time that happens, Intel would have gotten away from possibly the worst possible combination that you have (8 P and 4 E cores while trying to do something in both the background and the foreground).
 
  • Like
Reactions: Hulk

Hulk

Diamond Member
Oct 9, 1999
4,191
1,975
136
No matter how much work Intel puts into it, it won't be enough. That is because Windows can just override the Thread Director at its whimsy. The right solution is the a very long-term solution: recompile the software to properly place code on the right cores. By the time that happens, Intel would have gotten away from possibly the worst possible combination that you have (8 P and 4 E cores while trying to do something in both the background and the foreground).

Your prediction is grim and unfortunately most likely correct.
 

TheELF

Diamond Member
Dec 22, 2012
3,967
720
126
As far as I can tell NONE of that stuff they are talking about with the TD is happening. I'm going to have a little fun with this now. It's more like, "We at Intel have decided to put the foreground application on the P cores because we need the highest possible benchmark scores. We then cleverly shove all of the background tasks on the E cores. Kind of like sweeping dirt under the rug but in a modern way. See, we're Intel and we make old new again!"
But that would mean that TD director is working at least in this workflow of running a heavy workload together with background stuff.
 

Hulk

Diamond Member
Oct 9, 1999
4,191
1,975
136
But that would mean that TD director is working at least in this workflow of running a heavy workload together with background stuff.

I'm not following? Can you clarify?

I think the TD is working within an application but not among multiple workloads. Meaning if one application is running the TD seems to get it right in terms of assigning P's and E's to get the most performance. They had to get this right or the launch would have been disastrous from a benchmarking point of view.

The current logic starts to fail with multiple applications are running. The current logic is P's on foreground, E's on background when it should be that with the condition that when foreground is idle P's move to background to utilize all compute.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
https://chipsandcheese.com/2022/01/28/alder-lakes-power-efficiency-a-complicated-picture/

A bunch of cool plots. Really great analysis overall.
In summary:
  • Out of the box, the 12700K prioritizes absolute performance over power efficiency. “Race to sleep” is complete bullshit, at least until you get down to very low power levels.
  • Golden Cove is very efficient below 4 GHz, especially with a vectorized workload
  • Even though it’s paired with E-Cores, Golden Cove still scales well to very low power levels.
  • Gracemont is very efficient with integer workloads in the low 3 GHz range.
  • 256-bit instructions give Gracemont a hard time. With libx264, it needs to go below 3 GHz before it really shines in terms of energy efficiency
  • When run at sane clocks, both Alder Lake architectures show significant efficiency gains compared to Skylake
In terms of energy efficiency at similar clocks, Zen 2 cores are excellent. Golden Cove has to drop below 2 GHz to finish the encode job with the same energy budget as desktop Zen 2. Gracemont can do better, but also has to clock below 2 GHz. Again, we see desktop Zen 2 cores failing to gain efficiency at lower clock speeds. Their energy efficiency peaks when boost is turned off, and going lower actually makes the cores pull more total power. Renoir is much better at scaling down to low power. At least in the near future, AMD can probably get by without maintaining separate E-Core and P-Core architectures. They’re already covering both bases by changing L3 size and optimizing the same architecture for different power and performance targets.
 

dullard

Elite Member
May 21, 2001
24,998
3,327
126

sierpp

Junior Member
Aug 13, 2019
3
2
51
Something is fishy in this article. Looks like sponsored one.
They compare new and shiny Intel architecture to almost 3 years old AMD Zen 2. Time isn't standing still and AMD won't wait for intel to catch up.
I bet that comparison with Zen 3 doesn't look so good from efficiency perspective. At least they didn't include FX 9590 in the graphs ;-)
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
Something is fishy in this article. Looks like sponsored one.
They compare new and shiny Intel architecture to almost 3 years old AMD Zen 2. Time isn't standing still and AMD won't wait for intel to catch up.
I bet that comparison with Zen 3 doesn't look so good from efficiency perspective. At least they didn't include FX 9590 in the graphs ;-)
Huh? The comparisons to Zen 2 are already extremely favourable, idk what on earth you're talking about. Especially when Renoir gets added into the mix there, the power efficiency advantage is clear.

The only reason why Zen 3 wasn't tested is that it was easier for Clam to test on Zen 2. That's it.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
Sorry for my ignorance but easier how? He didn't have Zen 3 on hand or is there some other reason?
Yeah I don't think Clam does have one on hand. I know Cheese has a 5950X, but I don't think Clam does - I'm pretty sure the 3950X used for the review is Clam's desktop.
 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
Jul 27, 2020
15,759
9,823
106
  • Like
Reactions: Mopetar

diediealldie

Member
May 9, 2020
77
68
61
Alder Lake P Die anotation(i7 H Laptop)


Alder Lake S Die Anotation(Desktop i9)

I'm quite curious how Raptor Lake's Gracemont will be placed. There's a large void area between Ring Agent and GPU due to Gracemont cluster's width. I think that area inefficiency should be fixed somehow to make 16 GM cores(4 clusters) efficient. Intel managed to put something there at least for an ADL-P though.