Question E Cores - Useful or useless? What does the data tell us?

Hulk · Feb 19, 2023

Here are some benches from my 13900K with E cores and P cores set to 4.3GHz.

Cinebench R23
MT
8P with Hyperthreading - 17,525 (94W package)
16E (1 P @0.8GHz) - 18,157 (120W package)

ST
P - 1614
E - 1169

Handbrake bench from our forums
8P with Hyperthreading - 230.47 sec/7.84fps
16E (1 P @0.8GHz) - 219.45 sec/8.23fps

CPUmark99 - Yes old and outdated but the result is curious
P - 683
E - 722
Perhaps something to do with Gracemont's 17 vs. Raptor Coves 12 execution ports on this old single threaded integer benchmark?

For highly threaded applications like Handbrake (x265) and Cinebench rendering, 16E's have about the throughput (IPC) of 8P's. The P's are more energy efficient and can of course clock higher. The E's are more area efficient.

CB ST shows Raptor to have 38% better IPC than Gracemont. Of course Raptor loses it's HT capability here.

CB MT shows Raptor having 93% better IPC than Gracemont with Raptors HT in operation.

As for compute/area, let's assume 8.08mm^2 for one P including L2 and 10.28mm^2 for 1 E cluster with L2
Using CB R23MT at ISO frequencies we find that Raptor Cove generates 271 Cinebench MT points per square mm while Gracemont generates 442 Cinebench MT points per square mm. At ISO frequencies Gracemont provides 63% more compute with Cinebench for a given area. Of course some of this advantage is reduced when Raptor Cove is run at full frequency but the case for Gracemont's area efficiency is significant.

In terms of E and P "balancing" the 13900K is tilted a bit toward the P's. When you take the higher clocks of the P's into account, when using well threaded applications an 8+20 (or perhaps 24) part would provide a pretty even balance of E and P computer. Not that that metric means anything, I just find it interesting.

igor_kavinski · Feb 19, 2023

My main issue with the E-cores is that they seem hacked on. A better design would be to have a P+E cluster of one P-core and maybe 2 or even 4 E-cores, all sharing the same larger and combined caches. This would minimize the overhead in migrating threads between P and E cores. It may end up being die space efficient and even power efficient.

Exist50 · Feb 19, 2023

igor_kavinski said:
My main issue with the E-cores is that they seem hacked on. A better design would be to have a P+E cluster of one P-core and maybe 2 or even 4 E-cores, all sharing the same larger and combined caches. This would minimize the overhead in migrating threads between P and E cores. It may end up being die space efficient and even power efficient.

Hacked on, how? They share the same L3, just as P-core to P-core.

HurleyBird · Feb 19, 2023

Hulk said:
At ISO frequencies Gracemont provides 63% more compute with Cinebench for a given area. Of course some of this advantage is reduced when Raptor Cove is run at full frequency but the case for Gracemont's area efficiency is significant.

And further reduced if you reduce Raptor Cove size a bit to account for the disabled AVX 512 unit. And if you were to enable AVX 512 on the big cores, then the area efficiency comparison would look radically different for AVX 512 workloads.

That said, I think it's hard to tell based on AL/RL whether e-cores make sense for Intel going forward. Right now the coves are bigger than they ought to be, while the monts are far less energy efficient than they should be. If Intel's e-cores ever become energy efficient relative to their big cores like in the ARM world, then the hybrid architecture becomes much more compelling.

igor_kavinski · Feb 19, 2023

Exist50 said:
Hacked on, how? They share the same L3, just as P-core to P-core.

Wouldn't it be better if each P+E cluster shared L2 and these clusters shared L3?

Right now the P's are separate and relatively far from the E's.

Hulk · Feb 19, 2023

HurleyBird said:
And further reduced if you reduce Raptor Cove size a bit to account for the disabled AVX 512 unit. And if you were to enable AVX 512 on the big cores, then the area efficiency comparison would look radically different for AVX 512 workloads.

That said, I think it's hard to tell based on AL/RL whether e-cores make sense for Intel going forward. Right now the coves are bigger than they ought to be, while the monts are far less energy efficient than they should be. If Intel's e-cores ever become energy efficient relative to their big cores like in the ARM world, then the hybrid architecture becomes much more compelling.

Interesting thoughts. I believe the E cores will become more relevant as more software becomes better multithreaded and less bottlenecked by single core throughput. Outside of DC and rendering it's hard to find applications that will load up all of the P's and E's in a 13900 series CPU. Even Handbrake won't saturate all of them.

TheELF · Feb 19, 2023

Hulk said:
Here are some benches from my 13900K with E cores and P cores set to 4.3GHz.

Cinebench R23
MT
8P with Hyperthreading - 17,525 (94W package)
16E (1 P @0.8GHz) - 18,157 (120W package)

ST
P - 1614
E - 1169

Handbrake bench from our forums
8P with Hyperthreading - 230.47 sec/7.84fps
16E (1 P @0.8GHz) - 219.45 sec/8.23fps

You can set the thread count in MT, and I don't know if handbrake allows it but x264 in general will have an option to set thread number, that way you can eliminate the p-core influence on the performance results (won't help for the power numbers) using process lasso or similar to bind the task to the e-cores or p-cores specifically would also help.

Also the addition of the e-cores isn't to make the desktop CPU better for server workloads...
It's to give the user the same experience they are used to from their previous 8 core CPUs while they are doing up to another (of their previous) CPUs worth of work in the background.

Mopetar · Feb 19, 2023

I suspect that with some tweaking to the frequency there's a point where the E-cores would be more energy efficient as well as more area efficient. I think that Intel could release a chip for people who do have workloads that can utilize a lot of cores that is mainly (or even entirely) E-cores and it would be useful for that market segment.

Roland00Address · Feb 19, 2023

Hulk said:
Here are some benches from my 13900K with E cores and P cores set to 4.3GHz.

to my understanding that the E cores have an ideal voltage and frequency and past a certain point you are barely getting any additional performance but throwing voltages at them and massively increasing power consumption.

Note this is not just an Intel thing, it also applies to the Apple Silicon with there are parts of the curve when best to use E and P cores.

What I am saying we need to remember desktop vs laptop silicon and how Intel really wants the E cores to be in the 3.2 to 3.8 range and not the 4 or 4.3 range. Sure the chip can hit the 4 and 4.3 numbers but it changes the power consumption and when to use each core from an efficiency standpoint.

Exist50 · Feb 19, 2023

igor_kavinski said:
Wouldn't it be better if each P+E cluster shared L2 and these clusters shared L3?

Right now the P's are separate and relatively far from the E's.

I see a couple of issues with this.

1) Communication between cores is arbitrated through the ring, not the module, so you wouldn't change anything about how they communicate.

2) You'd need a way to combine the L2 caches, and they look very different for Atom vs Core.

3) You'd ossify the ratio between the two cores, which probably wouldn't be ideal for different market segments.

And all this, for what? They cores aren't distant now. They're right next to each other on the same ring. Even if you could reduce communication delays between certain subsets, what exactly would that get you?

Exist50 · Feb 19, 2023

Roland00Address said:
to my understanding that the E cores have an ideal voltage and frequency and past a certain point you are barely getting any additional performance but throwing voltages at them and massively increasing power consumption.

Note this is not just an Intel thing, it also applies to the Apple Silicon with there are parts of the curve when best to use E and P cores.

What I am saying we need to remember desktop vs laptop silicon and how Intel really wants the E cores to be in the 3.2 to 3.8 range and not the 4 or 4.3 range. Sure the chip can hit the 4 and 4.3 numbers but it changes the power consumption and when to use each core from an efficiency standpoint.

Yeah, I think testing at iso-frequency isn't really a meaningful way to compare the two in terms of PPA. I'd pick a constant voltage, say 0.8V or 1.0V, and use that for both. Would let the VF curve differences show up in the data.

Hulk · Feb 19, 2023

TheELF said:
You can set the thread count in MT, and I don't know if handbrake allows it but x264 in general will have an option to set thread number, that way you can eliminate the p-core influence on the performance results (won't help for the power numbers) using process lasso or similar to bind the task to the e-cores or p-cores specifically would also help.

Also the addition of the e-cores isn't to make the desktop CPU better for server workloads...
It's to give the user the same experience they are used to from their previous 8 core CPUs while they are doing up to another (of their previous) CPUs worth of work in the background.

The E's are slippery. I have worked with Process Lasso and sometimes they slip through. Anandtech noticed the same problem with Handbrake E scores when they first published the Alder Lake review.

x264 will use less threads than x265. Setting threads in Handbrake is pointless as it never changes things from my experience.

Hulk · Feb 19, 2023

Exist50 said:
Yeah, I think testing at iso-frequency isn't really a meaningful way to compare the two in terms of PPA. I'd pick a constant voltage, say 0.8V or 1.0V, and use that for both. Would let the VF curve differences show up in the data.

Great idea. Let's see it!

Exist50 · Feb 20, 2023

Hulk said:
The E's are slippery. I have worked with Process Lasso and sometimes they slip through. Anandtech noticed the same problem with Handbrake E scores when they first published the Alder Lake review.

x264 will use less threads than x265. Setting threads in Handbrake is pointless as it never changes things from my experience.

Now that we have ADL-N, maybe that will help clean up the data somewhat. It's a shame they blocked disabling of all the P cores in the other chips.

JoeRambo · Feb 20, 2023

igor_kavinski said:
Wouldn't it be better if each P+E cluster shared L2 and these clusters shared L3?

One of those ideas that look great on paper, but in fact are not so great in practice.
Why do we have cache levels with latencies and sizes they are now? To be able to clock 6Ghz and to still run 2MB of L2 at 15 cycle latency is important part of that equation.
Creating 6MB L2 cluster with 1P and 4E cores would tank latency big time, not only the size goes up, complexity of caching is up well due to sharing between cores, making sure a toxic workload on 1 unit does not ruin performance for all 5 etc.
Apple can do it with great latency in cycles cause they run at half the frequency and don't have to deal with 24 cores either.

igor_kavinski · Feb 20, 2023

JoeRambo said:
Creating 6MB L2 cluster with 1P and 4E cores would tank latency big time, not only the size goes up, complexity of caching is up well due to sharing between cores

Is it impossible to create a segmented cache where part of the cache is lower latency, catering to the P-core's higher frequency while the other part is more relaxed for the E-core's frequency deficit?

JoeRambo · Feb 20, 2023

igor_kavinski said:
Is it impossible to create a segmented cache where part of the cache is lower latency, catering to the P-core's higher frequency while the other part is more relaxed for the E-core's frequency deficit?

So even more complexity? And what exactly is the gain of such cache, if the size is limited, while all drawback still apply?

They went with tried and true ring solution. I think it has been discussed on these very forums, that Alder Lake was let down not by engineers but by marketing retards who asked for mobile CPU that was not really meant for desktop. So everything from memory subsystem to power delivery was designed with that in mind.
And the marketing geniuses came back asking for performance, that required pushing E cores to clocks that are not achievable with FIVR generated power and it started devious cycle of power consumption.

Does not take a rocket scientist to see that Intel will fix these deficiencies by moving E cores to different power plane and will continue beefing up the core. People will soon forget the original E cores and welcome the new ones.

Exist50 · Feb 20, 2023

JoeRambo said:
One of those ideas that look great on paper, but in fact are not so great in practice.
Why do we have cache levels with latencies and sizes they are now? To be able to clock 6Ghz and to still run 2MB of L2 at 15 cycle latency is important part of that equation.
Creating 6MB L2 cluster with 1P and 4E cores would tank latency big time, not only the size goes up, complexity of caching is up well due to sharing between cores, making sure a toxic workload on 1 unit does not ruin performance for all 5 etc.
Apple can do it with great latency in cycles cause they run at half the frequency and don't have to deal with 24 cores either.

Eh, I don't think the cache size would be that big of an issue by itself. Add another, say, 2-3 cycles. Wouldn't be the end of the world.

JoeRambo · Feb 20, 2023

Exist50 said:
Eh, I don't think the cache size would be that big of an issue by itself. Add another, say, 2-3 cycles. Wouldn't be the end of the world.

Quite different architecture tho: P-Core has 64byte interface to itself, E-cores share 64bytes, when all 4 are active that is 16bytes per cycle. So the latency floor would not start from current P-Core in RPL, but rather from ADL E-Core cluster ( both 2mb sized). So already 20+ cycles region to start with.

I think an important next step for Intel's memory subsystem will come with chiplets. I just hope it won't be a mesh style mess.

Hulk · Feb 20, 2023

JoeRambo said:
Quite different architecture tho: P-Core has 64byte interface to itself, E-cores share 64bytes, when all 4 are active that is 16bytes per cycle. So the latency floor would not start from current P-Core in RPL, but rather from ADL E-Core cluster ( both 2mb sized). So already 20+ cycles region to start with.

I think an important next step for Intel's memory subsystem will come with chiplets. I just hope it won't be a mesh style mess.

I'm not sure chiplets are needed for the desktop at this point in time. Intel has already shown that they can do 8+16 monolithic on Intel 7. If they move this arrangement to Intel 4 there will be additional transistors available to improve both the P's and E's. Unless of course they can achieve the same result with chiplets at lower manufacturing cost

igor_kavinski · Feb 20, 2023

I'm annoyed that they are not experimenting more. Where is the 2P+32E part? It should easily be possible on Intel 7 and it's gonna sell too. Heck, introduce dual socket HEDT and people could have 80 threads under $3000 with 4P+32E CPUs. Why are big companies so dumb?

Kocicak · Feb 20, 2023

igor_kavinski said:
I'm annoyed that they are not experimenting more. Where is the 2P+32E part?

That CPU would badly fail in gaming benchmarks. Even the reviews that may not pay a lot of attention to gaming have a huge portions of them dedicated to it.

Excellent multithreading performance would not compensate for that dip in gaming performance.

Experimenting is what you can do if you have some money to burn, I am afraid Intel is not in that position now.

beginner99 · Feb 20, 2023

igor_kavinski said:
I'm annoyed that they are not experimenting more. Where is the 2P+32E part? It should easily be possible on Intel 7 and it's gonna sell too. Heck, introduce dual socket HEDT and people could have 80 threads under $3000 with 4P+32E CPUs. Why are big companies so dumb?

That would be very niche use-case.

Personally I don't really get the whole P and E cores thing. Especially the need for a a gazillion e-cores. Shouldn't like 2-4 very low power cores (like much lower power than what they are now, of course with lower performance) be enough for background tasks? What is actually meant with background tasks? Windows updates? checking email/other apps and notifications? All of these are likley IO (network) limited.

Not fan of Apple but IMHO they got it right for consumer: very beefy, high single-threaded performance cores. But less of them in total. 32E cores won't help applications load faster but a single very beefy core does. So I would see it exactly the opposite. like 8+P cores and maybe 4e cores.
But it all really depends greatly on the OS and schedulers as well. What actually is the defintion of background task? As far as I understand all user-actions go through p-cores.

igor_kavinski · Feb 20, 2023

In the context of browsing, more E-cores would mean more worker threads of background tabs working on the E-cores while the tab in focus feels snappier due to P-cores not having to deal with less important worker threads.

SPBHM · Feb 20, 2023

4.3GHz seems very high for E cores and very low for P cores, so the power efficiency is going to reflect that, E cores are probably happier closer to 3.5Ghz or something in terms of delivering the best balance in terms of performance/w

Question E Cores - Useful or useless? What does the data tell us?

Diamond Member

Lifer

Platinum Member

Platinum Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Lifer

Golden Member

Platinum Member

Golden Member

Diamond Member

Lifer

Golden Member

Diamond Member

Lifer

Diamond Member