Question Raptor Lake - Official Thread

Hulk · Dec 5, 2021

Since we already have the first Raptor Lake leak I'm thinking it should have it's own thread.
What do we know so far?
From Anandtech's Intel Process Roadmap articles from July:

Built on Intel 7 with upgraded FinFET
10-15% PPW (performance-per-watt)
Last non-tiled consumer CPU as Meteor Lake will be tiled

I'm guessing this will be a minor update to ADL with just a few microarchitecture changes to the cores. The larger change will be the new process refinement allowing 8+16 at the top of the stack.

Will it work with current z690 motherboards? If yes then that could be a major selling point for people to move to ADL rather than wait.

nicalandia · Feb 8, 2022

LightningZ71 said:
While that certainly paints an "optimum case scenario" for Gracemont, largely due to it completely eliminating the design trade-offs of using a "quad" of E-cores with a shared, and theoretically bandwidth limited shared L2 cache segment with a single, shared ring stop. As we have seen in other benches, attempting to fully load the E cores exposes those trade-offs and reduces the throughput of those cores, reducing their effective performance per area.

In other words, using ST numbers for the E cores distorts the real picture in heavy MT scenarios.

I agree, but since fully disabling the P cores and their L3$ is currently impossible and I don't think we will ever see a e core only CPU. This is the best case scenario for Gracemont

jpiniero · Feb 8, 2022

nicalandia said:
I agree, but since fully disabling the P cores and their L3$ is currently impossible and I don't think we will ever see a e core only CPU. This is the best case scenario for Gracemont

There is an Alder Lake N die coming. I am still assuming that it is 0+4 but it could be 1+4.

Zucker2k · Feb 8, 2022

LightningZ71 said:
While that certainly paints an "optimum case scenario" for Gracemont, largely due to it completely eliminating the design trade-offs of using a "quad" of E-cores with a shared, and theoretically bandwidth limited shared L2 cache segment with a single, shared ring stop. As we have seen in other benches, attempting to fully load the E cores exposes those trade-offs and reduces the throughput of those cores, reducing their effective performance per area.

In other words, using ST numbers for the E cores distorts the real picture in heavy MT scenarios.

Same could be said of other cores as well, or? Each of those cores had access to full resources that won't be available if other cores are engaged.

IntelUser2000 · Feb 8, 2022

@igor_kavinski You are right but missing my point. 25-30 hours in idle should translate to 15+ hours in web browsing, not 8-9. Currently there's 50%+ gap per Wh in battery life compared to ARM-based chips.

Since Icelake the idle battery life has been in the same ballpark compared to ARM platforms but huge gap in bursty workloads like web browsing. The majority of the gap is pretty obvious - the Intel platform cannot idle in between workloads as well as the ARM processors do.

Also the pre-Icelake based Intel platforms got lot less in idle battery tests but do just as better if not better in web browsing, suggesting the idle power numbers for Icelake and Tigerlake are almost irrelevant. Comparison to AMD shows the same. Idle battery life for AMD tests lower but the web browsing battery is equal or higher. Who cares about idle battery? No one buys laptops to keep it idle. Idle only matters if it contributes to bursty workload life which in this case doesn't.

You quoted software optimizations but they mean little if the hardware is deficient.

repoman27 said:
How do you see integrating the PCH into the CPU die helping performance or battery life in any significant way? Intel is disaggregating the CPU die into CPU/GPU/SoC tiles for Meteor Lake, not integrating the PCH. Intel is doing the exact opposite of what you're suggesting.

You are right, it's not on-die but technologies like Foveros significantly reduces the problems related to multi-chip packaging. Things like power use and latency, which won't be at the level of on-die but close enough that even GPUs will eventually use it to get MCM out of them.

And mind you on package on Haswell allowed 50% gains to happen. I am not saying it's the single source of gains, certainly not but without it likely wouldn't have happened.

igor_kavinski · Feb 8, 2022

IntelUser2000 said:
You quoted software optimizations but they mean little if the hardware is deficient.

So Intel's transistors are leakier than AMD's? Or is this a power gating issue that started with Icelake and so far hasn't been fixed?

Doug S · Feb 8, 2022

IntelUser2000 said:
E cores have nothing to do with battery life. Zero. They are there to enable better perf/watt.

You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??

dullard · Feb 8, 2022

Doug S said:
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??

Performance per Watt is tangentially related. The direct measurement is Joules per task (or tasks per joule).

Suppose the user woke up the computer to do 1000 tasks. Suppose computer (A) used 25 W of power for 10 seconds then went back to sleep. Suppose computer (B) used 15 W of power for 20 seconds then went back to sleep. Both computers completed the assigned tasks. So computer (A) had a performance per Watt of 1000 tasks / 25 W = 40 tasks / W. Computer B has a performance per Watt of 1000 tasks / 15 W = 66.6 tasks / Watt. So, you could conclude that computer B has a higher performance per Watt.

But, computer (A) did all the tasks in 25 W * 10 s = 250 Joules. And computer (B) required 15 W * 20 s = 300 Joules. Computer (B) had the higher performance per Watt but the battery drained more in computer (B).

nicalandia · Feb 8, 2022

Doug S said:
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??

Not at Peak, a core with better power gating algorithm will have a better perf/watt in overall use case scenario than full blast performance. Also enhancements in battery capacity also gets better and is not related

Hulk · Feb 8, 2022

nicalandia said:
Okay, so perhaps the TPU MT core numbers were off(most likely they did not know they had an active P core )

So lets use ST to test Performance Per Area at stock speed, also would like to compare it with Apple M1 Firestorm Performance Core and AMD Zen 3 for reference

View attachment 57150

Intel Golden Cove core with L2$ as measured by Locuza is 7.04 mm2 and it gets 1,937 points in GB5
Apple Firestorm core with L2$ measured by Semianalysis is 3.83 mm2 and gets 1,745 points in GB5
AMD Zen 3 core with L2$ as measured by Locuza is 4.27 mm2 and it gets 1,506 points in GB5
Intel Gracemont core with L2$ as measured by Locuza is 2.19 mm2 and gets 1,168 in GB5

Performance/mm2

1st place is Intel Gracemont core with 532 Geekbench5 points per mm2
2nd place is Apple Firestorm core with 455.6 Geekbench5 points per mm2
3rd place is AMD Zen3 core with 352.7 Geekbench5 points per mm2
4th place is Intel Golden Cove with 275.1 Geekbench5 points per mm2

I left out L3$ because Apple Firestorm lack L3$ and the L3$ along with the Ring Bus are Huge in Intel recent CPU uArchs skewing the numbers.

Using ST test completely wipes out SMT of any cores that have it.
Not a true representation of GC.

LightningZ71 · Feb 8, 2022

Zucker2k said:
Same could be said of other cores as well, or? Each of those cores had access to full resources that won't be available if other cores are engaged.

With respect to accessing main memory, yes, all cores are resource limited. With respect to the architecture of the whole processor, the E cores, when under "full" load all have to fight for access to their shared L2 pool, theough a limited pipe, and all have to go through the same ring stop (in a 4 core complex) to reach the memory controller. The P cores have a private ring stop each and their own, private, though smaller L2 pool.

The P cores have consistent and lower resource contention. The e cores have variable and higher respurce contention that is hidden in ST scenarios.

nicalandia · Feb 8, 2022

Hulk said:
Using ST test completely wipes out SMT of any cores that have it.
Not a true representation of GC.

Just add the additional 30% performance increase of HT to them like what WCCFTECH did, it makes Golden Cove 30% more efficient per area mm2, but still fall short of the group.

I did.

Intel Gracemont core 1295 points / 2.2 mm2 : 588 points per mm2

AMD Zen3 core with SMT 1997 points / 4.2 mm2 : 475 points per mm2

Apple Firestorm core 1521 points / 3.76 mm2 : 404 points per mm2

Intel Golden Cove core with HT(1C/2T) 2600 points / 7.04 mm2 : 369 per mm2

Exclusive: Why Apple M1 Single "Core" Comparisons Are Fundamentally Flawed (With Benchmarks)

I have something pretty exciting for our readers today; something that almost everyone appears to have missed in the clamor for Apple M1 benchmark comparisons. What if I told you that pretty much all of the single-core benchmark comparisons between the Apple M1 and modern x86 processors you see...

wccftech.com

coercitiv · Feb 8, 2022

nicalandia said:
Intel Gracemont core 1295 points / 2.2 mm2 : 588 points per mm2

AMD Zen3 core with SMT 1997 points / 4.2 mm2 : 475 points per mm2

Apple Firestorm core 1521 points / 3.76 mm2 : 404 points per mm2

Intel Golden Cove core with HT(1C/2T) 2600 points / 7.04 mm2 : 369 per mm2

Great, now do Icestorm:

The two E cores in an M1 Pro, when at 100% active residency and maximum frequency can outperform a single P core at 100% active residency and maximum frequency, while using one fifth of the power.

nicalandia · Feb 8, 2022

coercitiv said:
Great, now do Icestorm:

They are Tiny, but I have yet to see someone run CBR23 or Geekbench5 on them to do the test, but I am reading that article right now...

IntelUser2000 · Feb 8, 2022

Doug S said:
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??

It has an effect but it's not as big as you think here, obviously demonstrated by the fact that their own Pentium Silver platform does just as well, and ARM chips even back when they were much slower significantly outperformed it in the battery life metric.

The HUGI concept is a very simplistic layman's way of understanding battery life, it's one factor, and something that was relevant in the mid-2000s.

Scenario 1: 1W idle, 20W load, 5% active = total power 2W(1+1)
Scenario 2: 1W idle, 20W load, 4% active = total power 1.8W(1+0.8)
Scenario 3: 6W idle, 20W load, 5% active = total power 7W(6+1)
Scenario 4: 6W idle, 20W load, 4% active = total power 6.8W(6+0.8)
Scenario 5: 6W idle, 6W load, 10% active = total power 6.6W(6+0.6)

(By idle I mean idle in actual workload not when the system is literally doing nothing)

When the idle is really low, improving perf/watt by 20% results in battery life improvement of 10%. At high idle, the same results in only 3% gains. Also in the case where the CPU is low power and has 33% higher perf/watt, the battery life gain is only 6%.

It makes whatever efficiency gain there is in E cores irrelevant, since 33% = 6%.

Also there are lots of tasks where the HUGI concept doesn't apply, like when you are playing games, or in the most real world case where people are trying to do more. In such heavy load, sure the higher perf/watt system is faster(since they have the same TDP) but you end up with same miserable 2 hours battery life.

igor_kavinski said:
So Intel's transistors are leakier than AMD's? Or is this a power gating issue that started with Icelake and so far hasn't been fixed?

Did you miss my previous post? Intel has the PCH on package.

AMD had on-die PCH since Carrizo or something. Up until recently they were very behind power management but they are no longer and the on-die PCH advantages are going to start to show. Like I said, it's an enabler. You still need to work at it to get it working properly.

As much as Skylake-derivatives have become a meme for being with us for so long, Skylake and Kabylake was the last generation where we had any battery life gains. I assure you ICL/TGL is a possible regression over Cometlake, and at the best case equal.

I see this as similar to how Intel stuck with the GTL+ bus introduced with Pentium Pro until 2008. Sure, they got it to 800MHz frequency and all that. But while their tiny underfunded direct competitor was making hypertransport and integrated memory controller, it took Intel that long to get off that FSB train.

Perhaps it's a blessing in disguise the previous crappy management put the fab part at risk. See I assume using PCH to fill fabs was a big, big thing for Intel. It's a short-sighted decision because you end up with a subpar product --> lower revenues --> risk being behind in fabrication development.

The external PCH also makes interfacing with the server and the desktop market easier. You just pair it with a bigger one. Also why I assume they waited until EMIB and Foveros to emulate the on-die connection.

I know they have the technical tour-de-force to outcompete everyone. Nehalem pummelled others in server. It just takes them forever.

nicalandia · Feb 9, 2022

coercitiv said:
Great, now do Icestorm:

That link you posted is very interesting, but we have to base our Performance/Area on currently available data that have been measured(or extrapolated) that is available.

For this I am using Anandtech SPEC2006 ST Suit for Floating Point and Integer Benchmarks

SPEC2006 - 453.povray ST

Still the Gracemont core reign supreme in Performance/Area

First Place:
Intel Gracemont core with L2$ area is 2.2 mm2
tested under SPEC2006 - 453.povray gets 59.50 points so the performance point per area is 27.04

Second Place:
Apple Firestorm core with L2$ area is 3.83 mm2 tested under SPEC2006 - 453.povray gets 88.80 points so the performance points per area is 23.18

Third Place:
Intel Golden Cove core with L2$ area is 7.04 mm2. Tested under SPEC2006 - 453.povray gets 117.7 points so the performance points per area is 16.71

Fourth Place:
Apple Icestorm core with L2$ area is 1.445 mm2. Tested under SPEC2006 - 453.povray gets 23.72 points so the performance points per area is 16.41

I will continue this on the Performance/Area thread, I don't want to derail this.

Henry swagger · Feb 9, 2022

WWi

nicalandia said:
That link you posted is very interesting, but we have to base our Performance/Area on currently available data that have been measured(or extrapolated) that is available.

For this I am using Anandtech SPEC2006 ST Suit for Floating Point and Integer Benchmarks

SPEC2006 - 453.povray ST

Still the Gracemont core reign supreme in Performance/Area

First Place:
Intel Gracemont core with L2$ area is 2.2 mm2
tested under SPEC2006 - 453.povray gets 59.50 points so the performance point per area is 27.04

Second Place:
Apple Firestorm core with L2$ area is 3.83 mm2 tested under SPEC2006 - 453.povray gets 88.80 points so the performance points per area is 23.18

Third Place:
Intel Golden Cove core with L2$ area is 7.04 mm2. Tested under SPEC2006 - 453.povray gets 117.7 points so the performance points per area is 16.71

Fourth Place:
Apple Icestorm core with L2$ area is 1.445 mm2. Tested under SPEC2006 - 453.povray gets 23.72 points so the performance points per area is 16.41

I will continue this on the Performance/Area thread, I don't want to derail this.

Will gracemont get more powerful uts getting more l2 cache for raptor lake ?..i think

nicalandia · Feb 9, 2022

Henry swagger said:
WWi

Will gracemont get more powerful uts getting more l2 cache for raptor lake ?..i think

I would say so, having double L2$ will increase performance on apps that take advantage of larger L2$

Hulk · Feb 9, 2022

nicalandia said:
I would say so, having double L2$ will increase performance on apps that take advantage of larger L2$

I'm going to write it one more time just for clarity.
I computed the performance of Gracemont in CB R23 by first getting the P+E score. Next I turned of the E's in the BIOS and ran the same test with only the P's. Subtracting P from P+E was the E score.

Results
At equal clocks E is 53% more area performant than P.
At 5GHz P and 3.8GHz E, E is 16% more area performant than E.

nicalandia · Feb 9, 2022

Hulk said:
I'm going to write it one more time just for clarity.
I computed the performance of Gracemont in CB R23 by first getting the P+E score. Next I turned of the E's in the BIOS and ran the same test with only the P's. Subtracting P from P+E was the E score.

Results
At equal clocks E is 53% more area performant than P.
At 5GHz P and 3.8GHz E, E is 16% more area performant than E.

The issue I have with that is that all of the P cores have an unfair advantage on that scenario. First, they get to use the L3$ left by the disabled e cores and the P cores get to use the allocated total package power so they get more room(as in power and cooling area) to stretch their legs better.

Is there any good Multi threaded app we can use that do not use extra cache and it's done rather quick(perhaps Geekbench5)?

Hulk · Feb 9, 2022

nicalandia said:
The issue I have with that is that all of the P cores have an unfair advantage on that scenario. First, they get to use the L3$ left by the disabled e cores and the P cores get to use the allocated total package power so they get more room(as in power and cooling area) to stretch their legs better.

Is there any good Multi threaded app we can use that do not use extra cache and it's done rather quick(perhaps Geekbench5)?

P cores were locked at the same frequency for both tests so power envelope is not a factor. There was no throttling.
CB is not very dependent on L3.

Exist50 · Feb 9, 2022

Here's an interesting question. In Alder Lake, the Atom cores are the furthest from memory along the ring, right? So does that matter at all?

nicalandia · Feb 9, 2022

Exist50 said:
Here's an interesting question. In Alder Lake, the Atom cores are the furthest from memory along the ring, right? So does that matter at all?

None at all, they have the same access

Exist50 · Feb 9, 2022

nicalandia said:
None at all, they have the same access

View attachment 57214

The ring only connects to the memory controller at the left.

nicalandia · Feb 9, 2022

Exist50 said:
The ring only connects to the memory controller at the left.

That's RAM memory controller, it's location on the SOC has little to no effect on the Ring Bus that connect much faster L3$

Exist50 · Feb 9, 2022

nicalandia said:
That's RAM memory controller, it's location on the SOC has little to no effect on the Ring Bus that connect much faster L3$

What are you trying to say? That memory latency doesn't matter? There were also some interesting asymmetries in the ring for Comet Lake. https://www.anandtech.com/show/15785/the-intel-comet-lake-review-skylake-we-go-again/4

Question Raptor Lake - Official Thread

Diamond Member

Diamond Member

Lifer

Golden Member

Elite Member

Lifer

Platinum Member

Elite Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member