Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 368 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
696
602
106
PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

ModelCode-NameDateTDPNodeTilesMain TileCPULP E-CoreLLCGPUXe-cores
Core Ultra 100UMeteor LakeQ4 202315 - 57 WIntel 4 + N5 + N64tCPU2P + 8E212 MBIntel Graphics4
?Lunar LakeQ4 202417 - 30 WN3B + N62CPU + GPU & IMC4P + 4E08 MBArc8
?Panther LakeQ1 2026 ??Intel 18A + N3E3CPU + MC4P + 8E4?Arc12



Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

Meteor LakeArrow Lake (20A)Arrow Lake (N3B)Lunar LakePanther Lake
PlatformMobile H/U OnlyDesktop OnlyDesktop & Mobile H&HXMobile U OnlyMobile H
Process NodeIntel 4Intel 20ATSMC N3BTSMC N3BIntel 18A
DateQ4 2023Q1 2025 ?Desktop-Q4-2024
H&HX-Q1-2025
Q4 2024Q1 2026 ?
Full Die6P + 8P6P + 8E ?8P + 16E4P + 4E4P + 8E
LLC24 MB24 MB ?36 MB ?12 MB?
tCPU66.48
tGPU44.45
SoC96.77
IOE44.45
Total252.15

LNL-MX.png

Intel Core Ultra 100 - Meteor Lake

INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)



Clockspeed.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,006
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,490
Last edited:

Hulk

Diamond Member
Oct 9, 1999
4,525
2,519
136
Exactly though. It is not the direct cost but indirect costs. Look at how transistor count ballooned with both Willamette and Prescott.

Here's a simple fact. Additional stages is a guaranteed loss in performance, while speculation and prediction is a maybe. Of course there's a tradeoff, but Apple(Cortex too) demonstrates that there's still an obsessive focus for clocks that needs to be cut again.

Pentium was a 5 stage pipeline CPU. Pentium Pro to Pentium III was a 12-stage pipeline: https://arstechnica.com/features/2004/07/pentium-1/

It was 14 on Conroe but extended to 16 on Nehalem.

Uop cache extends to 2 cycles on a miss but few cycles shorter on a hit. So let's say 13/14 to 18.

Golden Cove added an extra stage. So 14/15 to 19.

Intel's E core architectures are 14 stages, same as large cores without the uop cache. Intel claimed 80% hit on the uop cache with Sandy Bridge but C&C testing shows Zen 2's 4K uop cache has an average hit of 60%.
6 stages for Pentium MMX, right?
 
  • Like
Reactions: TwistedAndy

Doug S

Platinum Member
Feb 8, 2020
2,785
4,749
136
Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.

I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.


What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.
 

TwistedAndy

Member
May 23, 2024
159
150
76
What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.
Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".

There are reasons why Zen 4, for example, has 6-wide decode queue, 6-wide reordering buffer, and separate 6-wide INT- and FP execution blocks (rename/allocate, schedulers, ports). Obviously, AMD wanted to achieve the parallel execution of 6 operations and made the whole pipeline wide enough to handle it.

As for Skymont and Lion Cove, it's clear that Intel tried to make it wide enough to handle 8 uops in parallel. Probably, AMD had the same targets with Zen 5.
 
  • Like
Reactions: Henry swagger

Doug S

Platinum Member
Feb 8, 2020
2,785
4,749
136
Please note, that I'm talking about the execution throughput


So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

Designing a core to be CAPABLE of executing 6 or 8 uops per cycle (when you're going down a steep hill with a strong wind behind you) does not imply the design will come anywhere near AVERAGING 6 or 8 uops per cycle.
 

Nothingness

Diamond Member
Jul 3, 2013
3,090
2,084
136
Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".
Execution throughput is impacted by other parts of the design, in particular latency. You don't get faster by only increasing width; you need to feed your core with instructions and data which is why branch prediction (including implied misprediction latency) and data prefetch need to be improved to benefit from the increased width of other parts of the core. It's an art of balance and fine tuning, and no single feature can explain performance improvements.

And about the point @Doug S made: outside of micro benchmarks (tight loops with no miss), no CPU runs at 6 uops per cycle on average. Pick any reasonably sized program, run it, count number of instructions (there are tools for that) and you'll get the IPC (the real IPC, not perf/clock one we abusively call IPC).
 

TwistedAndy

Member
May 23, 2024
159
150
76
So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks ;)

Execution throughput is impacted by other parts of the design, in particular latency.

Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:


brp_ipc.png

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.

Also, there is another article on the same site with the IPC table:

1718691978222.png
As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.
 
Last edited:

Nothingness

Diamond Member
Jul 3, 2013
3,090
2,084
136
So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks ;)
Where did Doug say it's not worth the price? It definitely is worth the cost when the rest of your uarch is good enough. But you won't get 16% IPC increase by only increasing the width from 6 to 7.

Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:


View attachment 101361

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.
How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

This clearly shows that a real application has an IPC < 1.

Also, there is another article on the same site with the IPC table:

View attachment 101364
As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.
This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.
 

TwistedAndy

Member
May 23, 2024
159
150
76
But you won't get 16% IPC increase by only increasing the width from 6 to 7.
You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.

How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.

There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.

This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.

Here's another benchmark for Golden Cove:
1718696855705.png

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.

It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.
 
  • Like
Reactions: Henry swagger

Nothingness

Diamond Member
Jul 3, 2013
3,090
2,084
136
You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.
Yes you have to increase the width, but you also have to improve other areas unrelated to the width.

Third time I write it.

The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.
No you don't get an IPC of 5.5 in games. You seem unable to understand what ChipsAndCheese wrote.

ChipAndCheese clearly shows the "Achieved IPC" in game is 0.9 and adds "Both games experience low average IPC."

Second time I write it.

There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.



Here's another benchmark for Golden Cove:
View attachment 101365

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.
This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

Second time I write it.

It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.
Captain obvious.

You seem to not read what people write. For the nth time: micro benchmarks don't represent a full application performance. You even pretend that GB doesn't represent anything and now you exhibit micro benchmarks that just tell nothing about app perf.

Now as you don't seem completely stupid, I wonder where our misunderstanding comes from.
 

DavidC1

Senior member
Dec 29, 2023
940
1,473
96
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
 
  • Love
Reactions: Hulk

TwistedAndy

Member
May 23, 2024
159
150
76
This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.

Estimating clocks for the Skymont cluster in Lunar Lake.

Yep, I think in the Lunar Lake, the Skymont clocks will be somewhere between 3.0-3.5 GHz with ~1.5-2 W per core. Lion Cove will use 3-4 W per core.
 

SiliconFly

Golden Member
Mar 10, 2023
1,541
897
96
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
4.6GHz is also an unverified leak. I don't think we should get ahead of ourselves.For all we know, it might top out at just 4GHz!
 

DavidC1

Senior member
Dec 29, 2023
940
1,473
96
Lunarlake not only achieves 40% reduction in SoC power but PHY improved on memory to reduce by 40% as well.
4.6GHz is also an unverified leak. I don't think we should get ahead of ourselves.For all we know, it might top out at just 4GHz!
Then Lion Cove will clock at 5GHz and be 10-15% slower in ST compared to Raptor Cove. No I believe it has a high chance to be correct. There's no indication clocks are lower on Skymont.
 

Nothingness

Diamond Member
Jul 3, 2013
3,090
2,084
136
When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.
Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?
 
  • Like
Reactions: Hitman928

TwistedAndy

Member
May 23, 2024
159
150
76
Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?

Zero elimination has it's own preconditions. It's not something that takes place all the time. And it's described in the Intel manual you are referring to. That's why there are a few cases with MOV operations in the table.

Also, there are add and subtract operations, which directly involved the whole execution pipeline. As we may see, both Intel Golden Cove and AMD Zen 4 are pretty close to the target execution width in multiple tests (5.7 vs 6 uops per cycle).
 
  • Haha
Reactions: Nothingness

dullard

Elite Member
May 21, 2001
25,511
4,008
126
I heard sometime in July for Zen 5, not necessarily as late as the 31st. As for Intel, ARL may be formally announced in October, but the latest rumors I have heard are saying it may be early 2025 for significant availability. Considering the poor availability after the Mountain Lake release, the hype given to Lunar Lake, and the general lack of emphasis on ARL, I dont find that hard to believe at all. Hell, a major laptop manufacturer whose name slips my mind right now is expressing concern that Lunar Lake will not have good availability for the holiday season. I am confused too. I thought ARL was supposed to be the next big thing, while Lunar Lake was a niche product coming after ARL. Now it seems all the emphasis has shifted to LL, not a good sign for ARL, IMO.
Looks like July 15 for the Ryzen 300 Series laptop chips and either preorder or actual availability (unclear) of July 31 for the Ryzen 9000s. Since this is a discussion about Arrow Lake, which is best compared to the Ryzen 9000 line, I used July 31st. https://videocardz.com/newz/amd-ryz...00-sales-start-july-31-according-to-retailers

I don't follow how rumor of one complaint of Lunar Lake availability has much to do with Arrow Lake's release date and/or availability.

What is Mountain Lake?

Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?
 
Last edited:

TwistedAndy

Member
May 23, 2024
159
150
76
Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?

There's a lot of attention paid to Lunar Lake because it clearly shows what to expect from Arrow Lake, which has mostly the same P- and E-cores.

We will get more details about Arrow Lake only in August.

Intel usually announces new desktop CPUs in October. As for mobile CPUs, they are announced at CES in January. There's no reason to think that this year will be different.
 
Last edited:

Hulk

Diamond Member
Oct 9, 1999
4,525
2,519
136
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
I really like how you backed into lunar lake clocks for skymont by using Intel peak to peak performance versus crestmont with crestmont clocks being known . Very clever!
 

dullard

Elite Member
May 21, 2001
25,511
4,008
126
There's a lot of attention paid to Lunar Lake because it clearly shows what to expect from Arrow Lake, which has mostly the same P- and E-cores.

We will get more details about Arrow Lake only in August.

Intel usually announces new desktop CPUs in October. As for mobile CPUs, they are announced at CES in January. There's no reason to think that this year will be different.
There is attention for performance reasons. But performance wasn't the discussion point. See this quote below and notice how it was NOT about performance but about release dates. I still have yet to see why the release date of one product and availability of that product are reliant on an unrelated product.
As for Intel, ARL may be formally announced in October, but the latest rumors I have heard are saying it may be early 2025 for significant availability....Hell, a major laptop manufacturer whose name slips my mind right now is expressing concern that Lunar Lake will not have good availability for the holiday season
As for desktop Arrow Lake CPUs in October, that is exactly what I said earlier. https://forums.anandtech.com/thread...akes-discussion-threads.2606448/post-41232851
 

dullard

Elite Member
May 21, 2001
25,511
4,008
126
I'm always sure and never wrong. Never!!111 :p

There are, obviously, some differences like the amount of cache, the way how E-cores are connected, and SLC, but the cores in Arrow Lake will be mostly the same as Lunar Lake.

Probably there will be HT support for S and HX, but that's just rumours.
Add to your list:

1) Different memory types, speeds, and different memory controllers.

2) Any chips that get 20A would have an entirely new transistor (RibbonFET with faster transistor switching) and PowerVia (enabling higher frequencies, lower resistance, and lower capacitance).

3) Operating at a wholly different place on the performance/power curve.

4) Different core interconnects with drastically different numbers of cores.

Anything else different between these mostly same cores?
 

DavidC1

Senior member
Dec 29, 2023
940
1,473
96
I really like how you backed into lunar lake clocks for skymont by using Intel peak to peak performance versus crestmont with crestmont clocks being known . Very clever!
Thanks.

Lunarlake really shows promise. At 3GHz clocks and Raptor Cove exceeding performance, it'll be able to run many applications on the Skymont cluster alone. Whatever power advantage Skymont has will be fully demonstrated.

Plus...
-SLC cache to reduce going to main memory
-Mem PHY power reduction*
-Better Thread Director
-Better partitioning of blocks along with further improved power management.

*In previous chips, there LPDDR support was really for compatibility. It seems they have rebuilt the memory controller to lower power.
 
Last edited: