Question Raptor Lake - Official Thread

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hulk

Diamond Member
Oct 9, 1999
4,227
2,015
136
Since we already have the first Raptor Lake leak I'm thinking it should have it's own thread.
What do we know so far?
From Anandtech's Intel Process Roadmap articles from July:

Built on Intel 7 with upgraded FinFET
10-15% PPW (performance-per-watt)
Last non-tiled consumer CPU as Meteor Lake will be tiled

I'm guessing this will be a minor update to ADL with just a few microarchitecture changes to the cores. The larger change will be the new process refinement allowing 8+16 at the top of the stack.

Will it work with current z690 motherboards? If yes then that could be a major selling point for people to move to ADL rather than wait.
 
  • Like
Reactions: vstar

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
While that certainly paints an "optimum case scenario" for Gracemont, largely due to it completely eliminating the design trade-offs of using a "quad" of E-cores with a shared, and theoretically bandwidth limited shared L2 cache segment with a single, shared ring stop. As we have seen in other benches, attempting to fully load the E cores exposes those trade-offs and reduces the throughput of those cores, reducing their effective performance per area.

In other words, using ST numbers for the E cores distorts the real picture in heavy MT scenarios.
I agree, but since fully disabling the P cores and their L3$ is currently impossible and I don't think we will ever see a e core only CPU. This is the best case scenario for Gracemont
 

jpiniero

Lifer
Oct 1, 2010
14,605
5,223
136
I agree, but since fully disabling the P cores and their L3$ is currently impossible and I don't think we will ever see a e core only CPU. This is the best case scenario for Gracemont

There is an Alder Lake N die coming. I am still assuming that it is 0+4 but it could be 1+4.
 

Zucker2k

Golden Member
Feb 15, 2006
1,810
1,159
136
While that certainly paints an "optimum case scenario" for Gracemont, largely due to it completely eliminating the design trade-offs of using a "quad" of E-cores with a shared, and theoretically bandwidth limited shared L2 cache segment with a single, shared ring stop. As we have seen in other benches, attempting to fully load the E cores exposes those trade-offs and reduces the throughput of those cores, reducing their effective performance per area.

In other words, using ST numbers for the E cores distorts the real picture in heavy MT scenarios.
Same could be said of other cores as well, or? Each of those cores had access to full resources that won't be available if other cores are engaged.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
@igor_kavinski You are right but missing my point. 25-30 hours in idle should translate to 15+ hours in web browsing, not 8-9. Currently there's 50%+ gap per Wh in battery life compared to ARM-based chips.

Since Icelake the idle battery life has been in the same ballpark compared to ARM platforms but huge gap in bursty workloads like web browsing. The majority of the gap is pretty obvious - the Intel platform cannot idle in between workloads as well as the ARM processors do.

Also the pre-Icelake based Intel platforms got lot less in idle battery tests but do just as better if not better in web browsing, suggesting the idle power numbers for Icelake and Tigerlake are almost irrelevant. Comparison to AMD shows the same. Idle battery life for AMD tests lower but the web browsing battery is equal or higher. Who cares about idle battery? No one buys laptops to keep it idle. Idle only matters if it contributes to bursty workload life which in this case doesn't.

You quoted software optimizations but they mean little if the hardware is deficient.

How do you see integrating the PCH into the CPU die helping performance or battery life in any significant way? Intel is disaggregating the CPU die into CPU/GPU/SoC tiles for Meteor Lake, not integrating the PCH. Intel is doing the exact opposite of what you're suggesting.

You are right, it's not on-die but technologies like Foveros significantly reduces the problems related to multi-chip packaging. Things like power use and latency, which won't be at the level of on-die but close enough that even GPUs will eventually use it to get MCM out of them.

And mind you on package on Haswell allowed 50% gains to happen. I am not saying it's the single source of gains, certainly not but without it likely wouldn't have happened.
 

dullard

Elite Member
May 21, 2001
25,069
3,418
126
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??
Performance per Watt is tangentially related. The direct measurement is Joules per task (or tasks per joule).

Suppose the user woke up the computer to do 1000 tasks. Suppose computer (A) used 25 W of power for 10 seconds then went back to sleep. Suppose computer (B) used 15 W of power for 20 seconds then went back to sleep. Both computers completed the assigned tasks. So computer (A) had a performance per Watt of 1000 tasks / 25 W = 40 tasks / W. Computer B has a performance per Watt of 1000 tasks / 15 W = 66.6 tasks / Watt. So, you could conclude that computer B has a higher performance per Watt.

But, computer (A) did all the tasks in 25 W * 10 s = 250 Joules. And computer (B) required 15 W * 20 s = 300 Joules. Computer (B) had the higher performance per Watt but the battery drained more in computer (B).
 
Last edited:

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??
Not at Peak, a core with better power gating algorithm will have a better perf/watt in overall use case scenario than full blast performance. Also enhancements in battery capacity also gets better and is not related
 

Hulk

Diamond Member
Oct 9, 1999
4,227
2,015
136
Okay, so perhaps the TPU MT core numbers were off(most likely they did not know they had an active P core )

So lets use ST to test Performance Per Area at stock speed, also would like to compare it with Apple M1 Firestorm Performance Core and AMD Zen 3 for reference

View attachment 57150


Intel Golden Cove core with L2$ as measured by Locuza is 7.04 mm2 and it gets 1,937 points in GB5
Apple Firestorm core with L2$ measured by Semianalysis is 3.83 mm2 and gets 1,745 points in GB5
AMD Zen 3 core with L2$ as measured by Locuza is 4.27 mm2 and it gets 1,506 points in GB5
Intel Gracemont core with L2$ as measured by Locuza is 2.19 mm2 and gets 1,168 in GB5


Performance/mm2

1st place is Intel Gracemont core with 532 Geekbench5 points per mm2
2nd place is Apple Firestorm core with 455.6 Geekbench5 points per mm2
3rd place is AMD Zen3 core with 352.7 Geekbench5 points per mm2
4th place is Intel Golden Cove with 275.1 Geekbench5 points per mm2

I left out L3$ because Apple Firestorm lack L3$ and the L3$ along with the Ring Bus are Huge in Intel recent CPU uArchs skewing the numbers.

Using ST test completely wipes out SMT of any cores that have it.
Not a true representation of GC.
 

LightningZ71

Golden Member
Mar 10, 2017
1,628
1,898
136
Same could be said of other cores as well, or? Each of those cores had access to full resources that won't be available if other cores are engaged.
With respect to accessing main memory, yes, all cores are resource limited. With respect to the architecture of the whole processor, the E cores, when under "full" load all have to fight for access to their shared L2 pool, theough a limited pipe, and all have to go through the same ring stop (in a 4 core complex) to reach the memory controller. The P cores have a private ring stop each and their own, private, though smaller L2 pool.

The P cores have consistent and lower resource contention. The e cores have variable and higher respurce contention that is hidden in ST scenarios.
 
  • Like
Reactions: coercitiv

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
Using ST test completely wipes out SMT of any cores that have it.
Not a true representation of GC.
Just add the additional 30% performance increase of HT to them like what WCCFTECH did, it makes Golden Cove 30% more efficient per area mm2, but still fall short of the group.

I did.

Intel Gracemont core 1295 points / 2.2 mm2 : 588 points per mm2

AMD Zen3 core with SMT 1997 points / 4.2 mm2 : 475 points per mm2

Apple Firestorm core 1521 points / 3.76 mm2 : 404 points per mm2

Intel Golden Cove core with HT(1C/2T) 2600 points / 7.04 mm2 : 369 per mm2



 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
6,205
11,916
136
Intel Gracemont core 1295 points / 2.2 mm2 : 588 points per mm2

AMD Zen3 core with SMT 1997 points / 4.2 mm2 : 475 points per mm2

Apple Firestorm core 1521 points / 3.76 mm2 : 404 points per mm2

Intel Golden Cove core with HT(1C/2T) 2600 points / 7.04 mm2 : 369 per mm2
Great, now do Icestorm:
The two E cores in an M1 Pro, when at 100% active residency and maximum frequency can outperform a single P core at 100% active residency and maximum frequency, while using one fifth of the power.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
You don't think perf/watt has anything to do with battery life? How in the world could you reach that conclusion??

It has an effect but it's not as big as you think here, obviously demonstrated by the fact that their own Pentium Silver platform does just as well, and ARM chips even back when they were much slower significantly outperformed it in the battery life metric.

The HUGI concept is a very simplistic layman's way of understanding battery life, it's one factor, and something that was relevant in the mid-2000s.

Scenario 1: 1W idle, 20W load, 5% active = total power 2W(1+1)
Scenario 2: 1W idle, 20W load, 4% active = total power 1.8W(1+0.8)
Scenario 3: 6W idle, 20W load, 5% active = total power 7W(6+1)
Scenario 4: 6W idle, 20W load, 4% active = total power 6.8W(6+0.8)
Scenario 5: 6W idle, 6W load, 10% active = total power 6.6W(6+0.6)

(By idle I mean idle in actual workload not when the system is literally doing nothing)

When the idle is really low, improving perf/watt by 20% results in battery life improvement of 10%. At high idle, the same results in only 3% gains. Also in the case where the CPU is low power and has 33% higher perf/watt, the battery life gain is only 6%.

It makes whatever efficiency gain there is in E cores irrelevant, since 33% = 6%.

Also there are lots of tasks where the HUGI concept doesn't apply, like when you are playing games, or in the most real world case where people are trying to do more. In such heavy load, sure the higher perf/watt system is faster(since they have the same TDP) but you end up with same miserable 2 hours battery life.

So Intel's transistors are leakier than AMD's? Or is this a power gating issue that started with Icelake and so far hasn't been fixed?

Did you miss my previous post? Intel has the PCH on package.

AMD had on-die PCH since Carrizo or something. Up until recently they were very behind power management but they are no longer and the on-die PCH advantages are going to start to show. Like I said, it's an enabler. You still need to work at it to get it working properly.

As much as Skylake-derivatives have become a meme for being with us for so long, Skylake and Kabylake was the last generation where we had any battery life gains. I assure you ICL/TGL is a possible regression over Cometlake, and at the best case equal.

I see this as similar to how Intel stuck with the GTL+ bus introduced with Pentium Pro until 2008. Sure, they got it to 800MHz frequency and all that. But while their tiny underfunded direct competitor was making hypertransport and integrated memory controller, it took Intel that long to get off that FSB train.

Perhaps it's a blessing in disguise the previous crappy management put the fab part at risk. See I assume using PCH to fill fabs was a big, big thing for Intel. It's a short-sighted decision because you end up with a subpar product --> lower revenues --> risk being behind in fabrication development.

The external PCH also makes interfacing with the server and the desktop market easier. You just pair it with a bigger one. Also why I assume they waited until EMIB and Foveros to emulate the on-die connection.

I know they have the technical tour-de-force to outcompete everyone. Nehalem pummelled others in server. It just takes them forever.
 
Last edited:

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136

That link you posted is very interesting, but we have to base our Performance/Area on currently available data that have been measured(or extrapolated) that is available.

For this I am using Anandtech SPEC2006 ST Suit for Floating Point and Integer Benchmarks

SPEC2006 - 453.povray ST

Still the Gracemont core reign supreme in Performance/Area

First Place:
Intel Gracemont core with L2$ area is 2.2 mm2
tested under SPEC2006 - 453.povray gets 59.50 points so the performance point per area is 27.04

Second Place:
Apple Firestorm core with L2$ area is 3.83 mm2 tested under SPEC2006 - 453.povray gets 88.80 points so the performance points per area is 23.18

Third Place:
Intel Golden Cove core with L2$ area is 7.04 mm2. Tested under SPEC2006 - 453.povray gets 117.7 points so the performance points per area is 16.71

Fourth Place:
Apple Icestorm core with L2$ area is 1.445 mm2. Tested under SPEC2006 - 453.povray gets 23.72 points so the performance points per area is 16.41


I will continue this on the Performance/Area thread, I don't want to derail this.
 
Last edited:

Henry swagger

Senior member
Feb 9, 2022
371
239
86
WWi
That link you posted is very interesting, but we have to base our Performance/Area on currently available data that have been measured(or extrapolated) that is available.

For this I am using Anandtech SPEC2006 ST Suit for Floating Point and Integer Benchmarks

SPEC2006 - 453.povray ST

Still the Gracemont core reign supreme in Performance/Area

First Place:
Intel Gracemont core with L2$ area is 2.2 mm2
tested under SPEC2006 - 453.povray gets 59.50 points so the performance point per area is 27.04

Second Place:
Apple Firestorm core with L2$ area is 3.83 mm2 tested under SPEC2006 - 453.povray gets 88.80 points so the performance points per area is 23.18

Third Place:
Intel Golden Cove core with L2$ area is 7.04 mm2. Tested under SPEC2006 - 453.povray gets 117.7 points so the performance points per area is 16.71

Fourth Place:
Apple Icestorm core with L2$ area is 1.445 mm2. Tested under SPEC2006 - 453.povray gets 23.72 points so the performance points per area is 16.41


I will continue this on the Performance/Area thread, I don't want to derail this.
Will gracemont get more powerful uts getting more l2 cache for raptor lake ?..i think
 

Hulk

Diamond Member
Oct 9, 1999
4,227
2,015
136
I would say so, having double L2$ will increase performance on apps that take advantage of larger L2$

I'm going to write it one more time just for clarity.
I computed the performance of Gracemont in CB R23 by first getting the P+E score. Next I turned of the E's in the BIOS and ran the same test with only the P's. Subtracting P from P+E was the E score.

Results
At equal clocks E is 53% more area performant than P.
At 5GHz P and 3.8GHz E, E is 16% more area performant than E.
 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
I'm going to write it one more time just for clarity.
I computed the performance of Gracemont in CB R23 by first getting the P+E score. Next I turned of the E's in the BIOS and ran the same test with only the P's. Subtracting P from P+E was the E score.

Results
At equal clocks E is 53% more area performant than P.
At 5GHz P and 3.8GHz E, E is 16% more area performant than E.
The issue I have with that is that all of the P cores have an unfair advantage on that scenario. First, they get to use the L3$ left by the disabled e cores and the P cores get to use the allocated total package power so they get more room(as in power and cooling area) to stretch their legs better.

Is there any good Multi threaded app we can use that do not use extra cache and it's done rather quick(perhaps Geekbench5)?
 

Hulk

Diamond Member
Oct 9, 1999
4,227
2,015
136
The issue I have with that is that all of the P cores have an unfair advantage on that scenario. First, they get to use the L3$ left by the disabled e cores and the P cores get to use the allocated total package power so they get more room(as in power and cooling area) to stretch their legs better.

Is there any good Multi threaded app we can use that do not use extra cache and it's done rather quick(perhaps Geekbench5)?

P cores were locked at the same frequency for both tests so power envelope is not a factor. There was no throttling.
CB is not very dependent on L3.
 
  • Like
Reactions: nicalandia

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Here's an interesting question. In Alder Lake, the Atom cores are the furthest from memory along the ring, right? So does that matter at all?