Discussion Intel current and future Lakes & Rapids thread

Page 598 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
posted on the wrong thread
 

Attachments

  • 1641653564220.png
    1641653564220.png
    20.9 KB · Views: 58
  • 1641653616618.png
    1641653616618.png
    35.5 KB · Views: 49
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Windows is not a real time OS.

Yup you are right.

However, unlike the mobile operating systems it has a Tick timer that happens relatively frequently. It's 12ms or something so pretty much "real-time". There's also the preemptive multitasking feature that allows you to have multiple running applications open.

So you need to align your power management features to account for all scenarios, complicated further by the fact that there is application support dating back to the 80's.

An x86 power management core might be the way to go, since it'll be lot less dependent on drivers to get it working, since it has binary compatibility with the main cores.

Although I don't know if they need Gracemont for that, even a low powered one. They used a downclocked Silvermont core for their last manufactured LTE modem. Perhaps the performance is needed to reduce context switching latencies when it needs to wake up the main cores again?

Things like hopping between cores randomly also has to do with trying to extract few extra % of performance out of it. Yes, you get few % extra versus having it on one core in a single threaded application. Of course they don't need to do that anymore with Turbo and everything.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
0.95V @ 3.3Ghz and with 1 E-core running Linpack package power is 9W
1.25V @ 4 ghz and with 1 E-core running Linpack package power is 20W

Just curious. What's the core power running 1 E-core on Linpack?

According to the test done by Chip and Cheese.

Tremont 2.9GHz:
-2.9W core, 5.34W package ST
-First additional core adds 2.1W to core power but next adds only 1.4W, meaning lower clocks. The last adds a further 1.8W, meaning the 3 extra cores add 5.44W for an average of 1.81W per core.

Skylake 6600K:
-15.531W core, ~22W package ST
-First additional core adds 8W, next adds 6.5W, and the last almost 6W, for an average of 7.13W per core

Gracemont:
-17.651W core, 26.5W package ST
-Every additional core adds 5W. The average addition per core is 5W, indicating no clock reduction(at least after the second core).

Chip and Cheese notes package adds 6W, but when a single core is active, the adder is 9W. I don't know if that's an anomaly.

Interesting thing is despite being "core" power, Gracemont is higher power than Skylake at 1C but the addition per core is lower. Meaning even the core power figure has a significant overhead.

So 5W may be a realistic core power for the default Gracemont at 3.9GHz, not 10W as some suggested. For the N5095 Tremont it seems to be 2W. In this particular benchmark, Gracemont is twice as fast so the perf/watt is not much lower.

But we must consider Gracemont might be way out of it's ideal frequency range. At 3.3GHz and 25% reduction in voltage assuming nothing else affects it - we end up with 2.7W, while still being 60% faster in this particular benchmark.

-Per clock Gracemont is 48% faster here and at 2.9GHz it might end with 2.38W. That's pretty close to 2.1W for Tremont. That's a dramatic perf/watt advantage.
-Skylake seems to be using 40-60% more power for otherwise the same performance. It's probably 60% since the first core addition resulted in additional 8W for Skylake.
-Golden Cove core is 17-19W by the way. Intel's claims of Gracemont having 2x perf/clock is correct.

I have to say based on this, it's quite a more efficient core than Tremont. A proper implementation on a laptop(Tablet?!) may result in similar core power usage at equal frequencies, meaning the 30% perf/clock increase will translate directly into perf/watt.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
7,355
17,424
136
But we must consider Gracemont might be way out of it's ideal frequency range. At 3.3GHz and 25% reduction in voltage assuming nothing else affects it - we end up with 2.7W, while still being 60% faster in this particular benchmark.
It is out of ideal frequency range, the V/f plot we have for Gracemont shows good scaling up until 3Ghz, after which it moves to another slope to scale up to 4Ghz.
  • From 2Ghz to 3Ghz voltage delta is ~130mV or 17%.
  • From 3Ghz to 4Ghz voltage delta is ~350mV or 37%.
However, keep in mind the same applies to Golden Cove: based on V/f curve this core works efficiently up until ~3.6GHz.
  • From 2.6Ghz to 3.6Ghz voltage delta is ~100mV or 12%.
  • From 3.6Ghz to 4.6Ghz voltage delta is ~250mV or 27%.
Here's the 12700K running 3.6Ghz P-core / 3Ghz E-core / 3Ghz bus, scores just under 18K in CB23 while staying under 75W. The CPU is not undervolted but the motherboard AC DC Loadline parameters are set to favorable settings (on Auto my board overvolts the CPU like crazy). RAM is overclocked but I can't be bothered to undo that as well, not for CB testing anyway.
CB23-opt-clocks.png
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,696
3,260
136
The packages in the CNET photos were almost certainly the MTL-M (U9) 2+8+2 configuration, but MTL-P (P28/H45) will most likely use a 6+8 CPU tile.

During the Intel Accelerated event, Intel showed off a test wafer of Meteor Lake compute tiles that measure 4.8 mm x 7.9 mm. The Meteor Lake test chips that CNET photographed during their Fab 42 tour contain a top tile that also measures 4.8 mm x 7.9 mm, which strikes me as being somewhat beyond coincidental. Not locating the SoC tile in between the CPU and GPU tiles seems like a bold strategy, as it would make interconnect routing a nightmare. So I think @wild_cracks and @Locuza_ might need to reassess.
I highly doubt Meteor Lake will have only 2C8c and 6C8c CPU tiles, when this kind of configuration is already present in Alder Lake. Let's not forget Raptor Lake will be released before Meteor Lake, too. I think It's highly likely that they will increase either P-cores or E-cores, or both of them.
 

mikk

Diamond Member
May 15, 2012
4,296
2,382
136
I'm not expecting a core count increase for Meteor Lake. If the reddit Leak is accurate even Arrow Lake stays at 8 P cores, although with 32 E cores.

Will feature an updated compute tile with 8/32 config for the high end enthusiast products

It clearly refers to a desktop tile, mobile Arrow Lake may not get this.

It is said that the mobile version of the Arrow Lake would feature 6 big cores and 8 little cores.


Maybe 6+8 refers to ARL-P 28W and not the higher end H series. I think 6+16 is the next logical step for ARL-H.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
It is out of ideal frequency range, the V/f plot we have for Gracemont shows good scaling up until 3Ghz, after which it moves to another slope to scale up to 4Ghz.
  • From 2Ghz to 3Ghz voltage delta is ~130mV or 17%.
  • From 3Ghz to 4Ghz voltage delta is ~350mV or 37%.
However, keep in mind the same applies to Golden Cove: based on V/f curve this core works efficiently up until ~3.6GHz.
  • From 2.6Ghz to 3.6Ghz voltage delta is ~100mV or 12%.
  • From 3.6Ghz to 4.6Ghz voltage delta is ~250mV or 27%.
Here's the 12700K running 3.6Ghz P-core / 3Ghz E-core / 3Ghz bus, scores just under 18K in CB23 while staying under 75W. The CPU is not undervolted but the motherboard AC DC Loadline parameters are set to favorable settings (on Auto my board overvolts the CPU like crazy). RAM is overclocked but I can't be bothered to undo that as well, not for CB testing anyway.

Do we know how those V/f plots compare to P/f under load plots?
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
I'm not expecting a core count increase for Meteor Lake. If the reddit Leak is accurate even Arrow Lake stays at 8 P cores, although with 32 E cores.



It clearly refers to a desktop tile, mobile Arrow Lake may not get this.




Maybe 6+8 refers to ARL-P 28W and not the higher end H series. I think 6+16 is the next logical step for ARL-H.
I'm curious what people would think of 4+16 for the P die.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
I'm curious what people would think of 4+16 for the P die.
If the idea is still to use that die for gaming laptops then I don't think it's a good idea. Games are starting to scale past 4c w/HT now, it's essentially the absolute minimum spec needed right now, and even then some games see significant performance loss with 4c8t vs 6c12t.

For thin and light laptops, it's absolutely fine. Or rather, probably better than 6+8 would be. Just not for gaming laptops.
 
  • Like
Reactions: Tlh97 and coercitiv

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
If the idea is still to use that die for gaming laptops then I don't think it's a good idea. Games are starting to scale past 4c w/HT now, it's essentially the absolute minimum spec needed right now, and even then some games see significant performance loss with 4c8t vs 6c12t.

For thin and light laptops, it's absolutely fine. Or rather, probably better than 6+8 would be. Just not for gaming laptops.
In a gaming context, I'm curious how many threads have to be "as fast as possible". Right now, GRT is written off for performance sensitive threads because of the gap with GLC of what? Around 2/3 GLC's performance? How does the tradeoff change as that gap shrinks? Would make a good study for anyone with ADL and a lot of time on their hands.
 
  • Like
Reactions: Hulk

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
In a gaming context, I'm curious how many threads have to be "as fast as possible". Right now, GRT is written off for performance sensitive threads because of the gap with GLC of what? Around 2/3 GLC's performance? How does the tradeoff change as that gap shrinks? Would make a good study for anyone with ADL and a lot of time on their hands.

This is a really interesting question. Since we can shut off P cores in the BIOS it would be easy to test games 8+0, 6+0, 4+0, 4+2, 4+4.
I'd do it except for the fact that I don't have a discrete GPU so I don't think there would be much use. Also I don't game so I don't have any games on my system.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
This is kind of interesting. As I've written about I use DxO PureRaw to process RAW image files. It provides really great results but requires a lot of compute. As it turns out I'm always rendering video or something when I'm running images through PureRaw on their way to PS for editing. The E's alone aren't strong enough to move the images through PureRaw for me but there is an option to use the GPU. My iGPU overclocked to an easy 1800MHz provides a HUGE throughput increase over the stock (auto) setting for the GPU and nearly equal the performance of 4 P's. It's funny, now when I'm rendering, processing with PureRaw, and editing in PS the CPU's AND iGPU is/are slammed!


DxO PureRaw time to convert 4 RAW images from Sony a6300 using the "DeepPrime" setting
CPU/GPUConfigurationTime (min.sec)SecondsTime per photoScoreRank
12700K8+42.0112130.258.26100%
12700K8+02.08128327.8195%
12700K7+02.2114135.257.0986%
12700K6+02.4216240.56.1775%
12700K5+03.0218245.55.4966%
12700K4+03.36216544.6356%
12700K770 iGPU o/c 18003.44224564.4654%
12700K3+04.3827869.53.6044%
12700K770 iGPU (stock auto)5.1031077.53.2339%
12700K2+06.41401100.252.4930%
Surface Laptop 2620 iGPU7.46466116.52.1526%
Surface Laptop 28250U10.35635158.751.5719%
12700K0+411.006601651.5218%
12700K0+113.10790197.51.2715%
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
This is a really interesting question. Since we can shut off P cores in the BIOS it would be easy to test games 8+0, 6+0, 4+0, 4+2, 4+4.
I'd do it except for the fact that I don't have a discrete GPU so I don't think there would be much use. Also I don't game so I don't have any games on my system.
You'd also need to lock frequencies for each core type and vary them to adjust the performance gap. Sadly, I don't have Alder Lake, so can't do it even if I had the time.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
The really interesting part to me is how little difference it makes going from 1 efficiency core to 4 of them. What's the bottleneck that's holding them back so badly.

I'm also curious of how it performs in a 1+0 configuration. We can probably extrapolate, but the efficiency core doesn't seem as though it would be that much worse.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
You'd also need to lock frequencies for each core type and vary them to adjust the performance gap. Sadly, I don't have Alder Lake, so can't do it even if I had the time.

When I finally get a GPU I can do it. It's easy to lock them all at 3.8GHz or some frequency around there that the E's can easily hold during benching.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
The E core clusters may be heavily L2 throughput bound in these tests.

Maybe that's the reason, but I'm not sure. Having a 12900 to test scaling beyond that would be interesting. There is only 2 MB of L2 for the efficiency cores so it's possible that if all four cores are trying to run a heavy workload the cache is getting thrashed really badly. Having the 0+3 and 0+2 results as well might be enlightening. If those had better performance that would certainly be the case.

I looked up the information on wikichip and something really stood out to me. I'm not sure if it's just a typo, but the 12700K is listed has having 1 MB of L3 cache for the efficiency cores. The 12900K lists 6 MB of L3 cache for the efficiency cores, which makes me think it's a typo, but if it weren't that would further suggest the cache being the culprit.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
Maybe that's the reason, but I'm not sure. Having a 12900 to test scaling beyond that would be interesting. There is only 2 MB of L2 for the efficiency cores so it's possible that if all four cores are trying to run a heavy workload the cache is getting thrashed really badly. Having the 0+3 and 0+2 results as well might be enlightening. If those had better performance that would certainly be the case.

I looked up the information on wikichip and something really stood out to me. I'm not sure if it's just a typo, but the 12700K is listed has having 1 MB of L3 cache for the efficiency cores. The 12900K lists 6 MB of L3 cache for the efficiency cores, which makes me think it's a typo, but if it weren't that would further suggest the cache being the culprit.
That's down to how the cache is being cut down in the cpus. i9 has 30MB, i7=25MB and i5=20MB
In theory in the i7, removing one cluster would take out the 3MB of cache associated with it but intel has disabled 1/6th of the l3 cache instead of just 1/10th. I don't think anyone's actually figured out how or why it's done that way yet.

My own guess is they can't just disable the slice because it would take the 3MB associated with the other ecore cluster. So they've instead disabled 1/6th of the cache in all 10 or 5 l3 modules. I think that's possible as each cache module is made up of at least six slices of 512KB or 1024KB
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
The really interesting part to me is how little difference it makes going from 1 efficiency core to 4 of them. What's the bottleneck that's holding them back so badly.

I'm also curious of how it performs in a 1+0 configuration. We can probably extrapolate, but the efficiency core doesn't seem as though it would be that much worse.

I have found that it's hard to keep the E's "out of the action" even with Process Lasso unless you kill them in the BIOS. When I get some time I'll start up with 1 P active so I can do some additional testing. Thing is with these low amounts of tested compute it takes so long to run my bench;)
 
  • Like
Reactions: Mopetar

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
The really interesting part to me is how little difference it makes going from 1 efficiency core to 4 of them. What's the bottleneck that's holding them back so badly.

I'm also curious of how it performs in a 1+0 configuration. We can probably extrapolate, but the efficiency core doesn't seem as though it would be that much worse.
I'd say any L2 sensitive task
 

dullard

Elite Member
May 21, 2001
25,994
4,608
126
”H2 20” is a huge red flag since product qualifications are planned down to the week. Even at Intel where qualification takes 4 or 5 quarters (which is longer than everywhere else), not even saying which quarter they plan on finishing qual means they expect delays.

Or you can just call me an AMD fanboy LOL.
Show examples from AMD/NV? My two cents, a half year window is acceptable when a product is in early to mid pre-silicon development. Not so much when first silicon has already arrived back.
However, like I said, a half year window is just FUD and likely CYA.
The whole point of the post-silicon schedule is a high confidence time plan based on prior experience to achieve qualification, which enables production. Which is precisely why it is down to the work-week.
@dmens, I was thinking about what you said about stating in January an H2 launch for the same year is a huge red flag. I'm wondering your thoughts on this Anandtech post in January about an H2 launch for the same year: https://www.anandtech.com/show/17152/amd-cpus-in-2022-ces
"Next-Gen Ryzen, featuring Zen 4 cores, 5nm manufacturing, and the new AM5 socket, is coming to market in the second half (2H) of 2022"
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Although I don't know if they need Gracemont for that, even a low powered one. They used a downclocked Silvermont core for their last manufactured LTE modem. Perhaps the performance is needed to reduce context switching latencies when it needs to wake up the main cores again?

Few corrections regarding the LTE modems. XMM7480 used Cortex-A5. XMM7560 used Airmont. XMM7660 used Cortex-A5 again, which was also the last LTE modem from Intel.
Not sure what you mean with "main cores", but the cores mentioned above are the main cores of the modem. The modem itself has lots of other cores in addition.
 
Last edited: