Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 364 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
686
576
106
PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

ModelCode-NameDateTDPNodeTilesMain TileCPULP E-CoreLLCGPUXe-cores
Core Ultra 100UMeteor LakeQ4 202315 - 57 WIntel 4 + N5 + N64tCPU2P + 8E212 MBIntel Graphics4
?Lunar LakeQ4 202417 - 30 WN3B + N62CPU + GPU & IMC4P + 4E08 MBArc8
?Panther LakeQ1 2026 ??Intel 18A + N3E3CPU + MC4P + 8E4?Arc12



Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

Meteor LakeArrow Lake (20A)Arrow Lake (N3B)Arrow Lake Refresh (N3B)Lunar LakePanther Lake
PlatformMobile H/U OnlyDesktop OnlyDesktop & Mobile H&HXDesktop OnlyMobile U OnlyMobile H
Process NodeIntel 4Intel 20ATSMC N3BTSMC N3BTSMC N3BIntel 18A
DateQ4 2023Q1 2025 ?Desktop-Q4-2024
H&HX-Q1-2025
Q4 2025 ?Q4 2024Q1 2026 ?
Full Die6P + 8P6P + 8E ?8P + 16E8P + 32E4P + 4E4P + 8E
LLC24 MB24 MB ?36 MB ??8 MB?
tCPU66.48
tGPU44.45
SoC96.77
IOE44.45
Total252.15



Intel Core Ultra 100 - Meteor Lake

INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

Clockspeed.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 23,984
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,456
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,817
4,121
136
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.

8-10 stages? :rolleyes: Teens seem to be more ideal. When was the last 8 pipline CPU released?
 
  • Like
Reactions: TwistedAndy

Hulk

Diamond Member
Oct 9, 1999
4,385
2,270
136
8-10 stages? :rolleyes: Teens seem to be more ideal. When was the last 8 pipline CPU released?
I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,817
4,121
136
I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.

Pretty sure the P6 was 5 stages. And like you said 14-19 seems about right, depending on uop hits. That's why I said less than 10 is probably not a good idea.
 

Doug S

Platinum Member
Feb 8, 2020
2,539
4,173
136
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.


While we're in violent agreement about what doomed P4's quixotic quest for 10 GHz, I wonder about the why.

Having a really long pipeline means a lot of stages that individually don't do a whole lot. But I don't see any reason why the pipe stages themselves need more transistors - if you are doing a 64 bit multiply in a single cycle versus having it spread across two cycles (just as an example) it takes basically the same number of transistors to accomplish that. What causes the transistor count to multiply are all the latches. What kills your power is the spread of the clock tree across all those additional latches.

Too bad no one has ever cracked asynchronous logic for full scale CPU cores so they could eliminate all those messy details. Is anyone even still researching that, or have they given it up as impossible? Maybe they need to give ChatGPT a crack at it lol
 

DavidC1

Senior member
Dec 29, 2023
462
665
96
Having a really long pipeline means a lot of stages that individually don't do a whole lot. But I don't see any reason why the pipe stages themselves need more transistors - if you are doing a 64 bit multiply in a single cycle versus having it spread across two cycles (just as an example) it takes basically the same number of transistors to accomplish that. What causes the transistor count to multiply are all the latches. What kills your power is the spread of the clock tree across all those additional latches.
Exactly though. It is not the direct cost but indirect costs. Look at how transistor count ballooned with both Willamette and Prescott.

Here's a simple fact. Additional stages is a guaranteed loss in performance, while speculation and prediction is a maybe. Of course there's a tradeoff, but Apple(Cortex too) demonstrates that there's still an obsessive focus for clocks that needs to be cut again.
I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.
Pentium was a 5 stage pipeline CPU. Pentium Pro to Pentium III was a 12-stage pipeline: https://arstechnica.com/features/2004/07/pentium-1/

It was 14 on Conroe but extended to 16 on Nehalem.

Uop cache extends to 2 cycles on a miss but few cycles shorter on a hit. So let's say 13/14 to 18.

Golden Cove added an extra stage. So 14/15 to 19.

Intel's E core architectures are 14 stages, same as large cores without the uop cache. Intel claimed 80% hit on the uop cache with Sandy Bridge but C&C testing shows Zen 2's 4K uop cache has an average hit of 60%.
 

TwistedAndy

Member
May 23, 2024
145
134
71
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.

Many people try to find a simple explanation of why one chip is faster than the other one.

Some try to explain it by the ISA and make a lot of stupid claims regarding RISC vs. CISC. Other ones try to explain it using the decoder width: just make the decoder wider, and that's all. From time to time, I have seen speculations about the L1 cache size and latency. A long time ago, there were also speculations regarding the pipeline length, especially with the Netburst launch...

The hard truth is that there's no silver bullet. When you design an architecture, you need to find the right balance to meet the requirements and restrictions.

Let's take Apple P-cores as an example. Initially, they were designed for mobile devices with lower clock speeds. Apple has to make them wide. The P-core in Apple M-series chips, for example, was designed to execute 8 uops per cycle on average. This approach requires much more complex structures, takes more area, and is more expensive. It offers better efficiency on lower frequencies but scales pretty badly on higher ones.

As a result, the core of the Apple M4 running on 4.5 GHz consumes more than twice the power of the M1 (7.21W vs. 3.43W). Actually, we are looking at power numbers similar to those of AMD Zen 4 running on 5.2-5.4 GHz and on the much older node.

Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.

I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.

Apple looks pretty confusing here because you can't increase the frequency indefinitely. We had that case many times before with Intel (Netburst, Skylake, Alder Lake). At a certain point, you have to introduce some major changes to the architecture. M4 is technically another refresh of M1 (M1+++). The case with Apple is even more complicated because they use the same P-cores for their phones, tablets, laptops, and even Mac Pro. They all have different priorities.
 

Hulk

Diamond Member
Oct 9, 1999
4,385
2,270
136
Exactly though. It is not the direct cost but indirect costs. Look at how transistor count ballooned with both Willamette and Prescott.

Here's a simple fact. Additional stages is a guaranteed loss in performance, while speculation and prediction is a maybe. Of course there's a tradeoff, but Apple(Cortex too) demonstrates that there's still an obsessive focus for clocks that needs to be cut again.

Pentium was a 5 stage pipeline CPU. Pentium Pro to Pentium III was a 12-stage pipeline: https://arstechnica.com/features/2004/07/pentium-1/

It was 14 on Conroe but extended to 16 on Nehalem.

Uop cache extends to 2 cycles on a miss but few cycles shorter on a hit. So let's say 13/14 to 18.

Golden Cove added an extra stage. So 14/15 to 19.

Intel's E core architectures are 14 stages, same as large cores without the uop cache. Intel claimed 80% hit on the uop cache with Sandy Bridge but C&C testing shows Zen 2's 4K uop cache has an average hit of 60%.
6 stages for Pentium MMX, right?
 
  • Like
Reactions: TwistedAndy

Doug S

Platinum Member
Feb 8, 2020
2,539
4,173
136
Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.

I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.


What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.
 

TwistedAndy

Member
May 23, 2024
145
134
71
What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.
Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".

There are reasons why Zen 4, for example, has 6-wide decode queue, 6-wide reordering buffer, and separate 6-wide INT- and FP execution blocks (rename/allocate, schedulers, ports). Obviously, AMD wanted to achieve the parallel execution of 6 operations and made the whole pipeline wide enough to handle it.

As for Skymont and Lion Cove, it's clear that Intel tried to make it wide enough to handle 8 uops in parallel. Probably, AMD had the same targets with Zen 5.
 
  • Like
Reactions: Henry swagger

Doug S

Platinum Member
Feb 8, 2020
2,539
4,173
136
Please note, that I'm talking about the execution throughput


So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

Designing a core to be CAPABLE of executing 6 or 8 uops per cycle (when you're going down a steep hill with a strong wind behind you) does not imply the design will come anywhere near AVERAGING 6 or 8 uops per cycle.
 

Nothingness

Platinum Member
Jul 3, 2013
2,817
1,506
136
Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".
Execution throughput is impacted by other parts of the design, in particular latency. You don't get faster by only increasing width; you need to feed your core with instructions and data which is why branch prediction (including implied misprediction latency) and data prefetch need to be improved to benefit from the increased width of other parts of the core. It's an art of balance and fine tuning, and no single feature can explain performance improvements.

And about the point @Doug S made: outside of micro benchmarks (tight loops with no miss), no CPU runs at 6 uops per cycle on average. Pick any reasonably sized program, run it, count number of instructions (there are tools for that) and you'll get the IPC (the real IPC, not perf/clock one we abusively call IPC).
 

TwistedAndy

Member
May 23, 2024
145
134
71
So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks ;)

Execution throughput is impacted by other parts of the design, in particular latency.

Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:


brp_ipc.png

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.

Also, there is another article on the same site with the IPC table:

1718691978222.png
As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.
 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,817
1,506
136
So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks ;)
Where did Doug say it's not worth the price? It definitely is worth the cost when the rest of your uarch is good enough. But you won't get 16% IPC increase by only increasing the width from 6 to 7.

Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:


View attachment 101361

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.
How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

This clearly shows that a real application has an IPC < 1.

Also, there is another article on the same site with the IPC table:

View attachment 101364
As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.
This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.
 

TwistedAndy

Member
May 23, 2024
145
134
71
But you won't get 16% IPC increase by only increasing the width from 6 to 7.
You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.

How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.

There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.

This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.

Here's another benchmark for Golden Cove:
1718696855705.png

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.

It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.
 
  • Like
Reactions: Henry swagger

Nothingness

Platinum Member
Jul 3, 2013
2,817
1,506
136
You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.
Yes you have to increase the width, but you also have to improve other areas unrelated to the width.

Third time I write it.

The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.
No you don't get an IPC of 5.5 in games. You seem unable to understand what ChipsAndCheese wrote.

ChipAndCheese clearly shows the "Achieved IPC" in game is 0.9 and adds "Both games experience low average IPC."

Second time I write it.

There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.



Here's another benchmark for Golden Cove:
View attachment 101365

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.
This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

Second time I write it.

It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.
Captain obvious.

You seem to not read what people write. For the nth time: micro benchmarks don't represent a full application performance. You even pretend that GB doesn't represent anything and now you exhibit micro benchmarks that just tell nothing about app perf.

Now as you don't seem completely stupid, I wonder where our misunderstanding comes from.
 

DavidC1

Senior member
Dec 29, 2023
462
665
96
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
 
  • Love
Reactions: Hulk

TwistedAndy

Member
May 23, 2024
145
134
71
This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.

Estimating clocks for the Skymont cluster in Lunar Lake.

Yep, I think in the Lunar Lake, the Skymont clocks will be somewhere between 3.0-3.5 GHz with ~1.5-2 W per core. Lion Cove will use 3-4 W per core.
 

SiliconFly

Golden Member
Mar 10, 2023
1,247
646
96
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
4.6GHz is also an unverified leak. I don't think we should get ahead of ourselves.For all we know, it might top out at just 4GHz!
 

DavidC1

Senior member
Dec 29, 2023
462
665
96
Lunarlake not only achieves 40% reduction in SoC power but PHY improved on memory to reduce by 40% as well.
4.6GHz is also an unverified leak. I don't think we should get ahead of ourselves.For all we know, it might top out at just 4GHz!
Then Lion Cove will clock at 5GHz and be 10-15% slower in ST compared to Raptor Cove. No I believe it has a high chance to be correct. There's no indication clocks are lower on Skymont.
 

Nothingness

Platinum Member
Jul 3, 2013
2,817
1,506
136
When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.
Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?
 
  • Like
Reactions: Hitman928

TwistedAndy

Member
May 23, 2024
145
134
71
Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?

Zero elimination has it's own preconditions. It's not something that takes place all the time. And it's described in the Intel manual you are referring to. That's why there are a few cases with MOV operations in the table.

Also, there are add and subtract operations, which directly involved the whole execution pipeline. As we may see, both Intel Golden Cove and AMD Zen 4 are pretty close to the target execution width in multiple tests (5.7 vs 6 uops per cycle).
 
  • Haha
Reactions: Nothingness

dullard

Elite Member
May 21, 2001
25,252
3,654
126
I heard sometime in July for Zen 5, not necessarily as late as the 31st. As for Intel, ARL may be formally announced in October, but the latest rumors I have heard are saying it may be early 2025 for significant availability. Considering the poor availability after the Mountain Lake release, the hype given to Lunar Lake, and the general lack of emphasis on ARL, I dont find that hard to believe at all. Hell, a major laptop manufacturer whose name slips my mind right now is expressing concern that Lunar Lake will not have good availability for the holiday season. I am confused too. I thought ARL was supposed to be the next big thing, while Lunar Lake was a niche product coming after ARL. Now it seems all the emphasis has shifted to LL, not a good sign for ARL, IMO.
Looks like July 15 for the Ryzen 300 Series laptop chips and either preorder or actual availability (unclear) of July 31 for the Ryzen 9000s. Since this is a discussion about Arrow Lake, which is best compared to the Ryzen 9000 line, I used July 31st. https://videocardz.com/newz/amd-ryz...00-sales-start-july-31-according-to-retailers

I don't follow how rumor of one complaint of Lunar Lake availability has much to do with Arrow Lake's release date and/or availability.

What is Mountain Lake?

Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?
 
Last edited:

TwistedAndy

Member
May 23, 2024
145
134
71
Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?

There's a lot of attention paid to Lunar Lake because it clearly shows what to expect from Arrow Lake, which has mostly the same P- and E-cores.

We will get more details about Arrow Lake only in August.

Intel usually announces new desktop CPUs in October. As for mobile CPUs, they are announced at CES in January. There's no reason to think that this year will be different.
 
Last edited:

Hulk

Diamond Member
Oct 9, 1999
4,385
2,270
136
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.
I really like how you backed into lunar lake clocks for skymont by using Intel peak to peak performance versus crestmont with crestmont clocks being known . Very clever!