Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Tigerick · Aug 22, 2022

Wildcat Lake (WCL) Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing Raptor Lake-U. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q1 2026.

	Intel Raptor Lake U	Intel Wildcat Lake 15W	Intel Lunar Lake	Intel Panther Lake 4+0+4
Launch Date	Q1-2024	Q2-2026	Q3-2024	Q1-2026
Model	Intel 150U	Intel Core 7 360	Core Ultra 7 268V	Core Ultra 7 365
Dies	2	2	2	3
Node	Intel 7 + ?	Intel 18-A + TSMC N6	TSMC N3B + N6	Intel 18-A + Intel 3 + TSMC N6

CPU	2 P-core + 8 E-cores	2 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores
Threads	12	6	8	8
Max Clock	5.4 GHz	4.8 GHz	5 GHz	4.8 GHz
L3 Cache	12 MB	6 MB	12 MB	12 MB
TDP	15 - 55 W	15 - 35 W	17 - 37 W	25 - 55 W

Memory	128-bit LPDDR5-5200	64-bit LPDDR5x-7467	128-bit LPDDR5x-8533	128-bit LPDDR5x-7467
Size	96 GB	48 GB	32 GB	128 GB
Bandwidth	83 GB/s	60 GB/s	136 GB/s	120 GB/s

GPU	Intel Graphics	Intel Graphics	Arc 140V	Intel Graphics
RT	No	No	YES	YES
EU / Xe	96 EU	2 Xe	8 Xe	4 Xe
Max Clock	1.3 GHz	2.6 GHz	2 GHz	2.5 GHz

NPU	GNA 3.0	17 TOPS	48 TOPS	49 TOPS

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Thunder 57 · Jun 16, 2024

DavidC1 said:
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.

8-10 stages? 🙄 Teens seem to be more ideal. When was the last 8 pipline CPU released?

Hulk · Jun 16, 2024

Thunder 57 said:
8-10 stages? 🙄 Teens seem to be more ideal. When was the last 8 pipline CPU released?

I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.

Thunder 57 · Jun 16, 2024

Hulk said:
I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.

Pretty sure the P6 was 5 stages. And like you said 14-19 seems about right, depending on uop hits. That's why I said less than 10 is probably not a good idea.

Doug S · Jun 16, 2024

DavidC1 said:
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

While we're in violent agreement about what doomed P4's quixotic quest for 10 GHz, I wonder about the why.

Having a really long pipeline means a lot of stages that individually don't do a whole lot. But I don't see any reason why the pipe stages themselves need more transistors - if you are doing a 64 bit multiply in a single cycle versus having it spread across two cycles (just as an example) it takes basically the same number of transistors to accomplish that. What causes the transistor count to multiply are all the latches. What kills your power is the spread of the clock tree across all those additional latches.

Too bad no one has ever cracked asynchronous logic for full scale CPU cores so they could eliminate all those messy details. Is anyone even still researching that, or have they given it up as impossible? Maybe they need to give ChatGPT a crack at it lol

DavidC1 · Jun 17, 2024

Doug S said:
Having a really long pipeline means a lot of stages that individually don't do a whole lot. But I don't see any reason why the pipe stages themselves need more transistors - if you are doing a 64 bit multiply in a single cycle versus having it spread across two cycles (just as an example) it takes basically the same number of transistors to accomplish that. What causes the transistor count to multiply are all the latches. What kills your power is the spread of the clock tree across all those additional latches.

Exactly though. It is not the direct cost but indirect costs. Look at how transistor count ballooned with both Willamette and Prescott.

Here's a simple fact. Additional stages is a guaranteed loss in performance, while speculation and prediction is a maybe. Of course there's a tradeoff, but Apple(Cortex too) demonstrates that there's still an obsessive focus for clocks that needs to be cut again.

Hulk said:
I think Pentium Pro/PII were 6 instruction stages, Pentium M (Banias) was 10.

Last we heard from Intel was 14-19 for Skylake depending on uop hit rate, which was about 80%.

Pentium was a 5 stage pipeline CPU. Pentium Pro to Pentium III was a 12-stage pipeline: https://arstechnica.com/features/2004/07/pentium-1/

It was 14 on Conroe but extended to 16 on Nehalem.

Inside Nehalem: Intel's Future Processor and System - Real World Tech

Learn the details on Intel's 45nm Nehalem processor, which features an new system interface: DDR3 integrated memory controllers and CSI or QuickPath Interconnects. The Penryn/Core microarchitecture has been substantially upgraded with improvements spanning the whole pipeline, especially notable...

www.realworldtech.com

Uop cache extends to 2 cycles on a miss but few cycles shorter on a hit. So let's say 13/14 to 18.

Intel's Sandy Bridge Microarchitecture - Page 4 of 10 - Real World Tech

At IDF, Intel revealed the future Sandy Bridge microprocessor. It is an entirely new design - a synthesis of Nehalem, ideas from the Pentium 4 and a new Gen 6 graphics architecture. The result is a novel microprocessor, GPU and system infrastructure tightly integrated into a 32nm chip. This...

www.realworldtech.com

Golden Cove added an extra stage. So 14/15 to 19.

Intel's E core architectures are 14 stages, same as large cores without the uop cache. Intel claimed 80% hit on the uop cache with Sandy Bridge but C&C testing shows Zen 2's 4K uop cache has an average hit of 60%.

TwistedAndy · Jun 17, 2024

DavidC1 said:
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.

Many people try to find a simple explanation of why one chip is faster than the other one.

Some try to explain it by the ISA and make a lot of stupid claims regarding RISC vs. CISC. Other ones try to explain it using the decoder width: just make the decoder wider, and that's all. From time to time, I have seen speculations about the L1 cache size and latency. A long time ago, there were also speculations regarding the pipeline length, especially with the Netburst launch...

The hard truth is that there's no silver bullet. When you design an architecture, you need to find the right balance to meet the requirements and restrictions.

Let's take Apple P-cores as an example. Initially, they were designed for mobile devices with lower clock speeds. Apple has to make them wide. The P-core in Apple M-series chips, for example, was designed to execute 8 uops per cycle on average. This approach requires much more complex structures, takes more area, and is more expensive. It offers better efficiency on lower frequencies but scales pretty badly on higher ones.

As a result, the core of the Apple M4 running on 4.5 GHz consumes more than twice the power of the M1 (7.21W vs. 3.43W). Actually, we are looking at power numbers similar to those of AMD Zen 4 running on 5.2-5.4 GHz and on the much older node.

Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.

I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.

Apple looks pretty confusing here because you can't increase the frequency indefinitely. We had that case many times before with Intel (Netburst, Skylake, Alder Lake). At a certain point, you have to introduce some major changes to the architecture. M4 is technically another refresh of M1 (M1+++). The case with Apple is even more complicated because they use the same P-cores for their phones, tablets, laptops, and even Mac Pro. They all have different priorities.

Nothingness · Jun 17, 2024

TwistedAndy said:
Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.

That made my day.

Hulk · Jun 17, 2024

DavidC1 said:
Exactly though. It is not the direct cost but indirect costs. Look at how transistor count ballooned with both Willamette and Prescott.

Here's a simple fact. Additional stages is a guaranteed loss in performance, while speculation and prediction is a maybe. Of course there's a tradeoff, but Apple(Cortex too) demonstrates that there's still an obsessive focus for clocks that needs to be cut again.

Pentium was a 5 stage pipeline CPU. Pentium Pro to Pentium III was a 12-stage pipeline: https://arstechnica.com/features/2004/07/pentium-1/

It was 14 on Conroe but extended to 16 on Nehalem.

Inside Nehalem: Intel's Future Processor and System - Real World Tech

Learn the details on Intel's 45nm Nehalem processor, which features an new system interface: DDR3 integrated memory controllers and CSI or QuickPath Interconnects. The Penryn/Core microarchitecture has been substantially upgraded with improvements spanning the whole pipeline, especially notable...

www.realworldtech.com

Uop cache extends to 2 cycles on a miss but few cycles shorter on a hit. So let's say 13/14 to 18.

Intel's Sandy Bridge Microarchitecture - Page 4 of 10 - Real World Tech

At IDF, Intel revealed the future Sandy Bridge microprocessor. It is an entirely new design - a synthesis of Nehalem, ideas from the Pentium 4 and a new Gen 6 graphics architecture. The result is a novel microprocessor, GPU and system infrastructure tightly integrated into a 32nm chip. This...

www.realworldtech.com

Golden Cove added an extra stage. So 14/15 to 19.

Intel's E core architectures are 14 stages, same as large cores without the uop cache. Intel claimed 80% hit on the uop cache with Sandy Bridge but C&C testing shows Zen 2's 4K uop cache has an average hit of 60%.

6 stages for Pentium MMX, right?

Doug S · Jun 17, 2024

TwistedAndy said:
Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.

I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.

What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.

TwistedAndy · Jun 18, 2024

Doug S said:
What in the heck are you talking about here? You obviously have zero understanding if you think anyone is processing "6 uops per cycle on average". No one is remotely close to that, and probably won't be in our lifetimes.

Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".

There are reasons why Zen 4, for example, has 6-wide decode queue, 6-wide reordering buffer, and separate 6-wide INT- and FP execution blocks (rename/allocate, schedulers, ports). Obviously, AMD wanted to achieve the parallel execution of 6 operations and made the whole pipeline wide enough to handle it.

As for Skymont and Lion Cove, it's clear that Intel tried to make it wide enough to handle 8 uops in parallel. Probably, AMD had the same targets with Zen 5.

Doug S · Jun 18, 2024

TwistedAndy said:
Please note, that I'm talking about the execution throughput

So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

Designing a core to be CAPABLE of executing 6 or 8 uops per cycle (when you're going down a steep hill with a strong wind behind you) does not imply the design will come anywhere near AVERAGING 6 or 8 uops per cycle.

Nothingness · Jun 18, 2024

TwistedAndy said:
Please note, that I'm talking about the execution throughput, not the total latency and other questionable things like "pipeline length".

Execution throughput is impacted by other parts of the design, in particular latency. You don't get faster by only increasing width; you need to feed your core with instructions and data which is why branch prediction (including implied misprediction latency) and data prefetch need to be improved to benefit from the increased width of other parts of the core. It's an art of balance and fine tuning, and no single feature can explain performance improvements.

And about the point @Doug S made: outside of micro benchmarks (tight loops with no miss), no CPU runs at 6 uops per cycle on average. Pick any reasonably sized program, run it, count number of instructions (there are tools for that) and you'll get the IPC (the real IPC, not perf/clock one we abusively call IPC).

TwistedAndy · Jun 18, 2024

Doug S said:
So am I. No one, I repeat NO ONE has an average throughput of 6 uops per cycle, or is remotely close. Stop now before you dig yourself even deeper.

So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks 😉

Nothingness said:
Execution throughput is impacted by other parts of the design, in particular latency.

Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.

Also, there is another article on the same site with the IPC table:

As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.

Nothingness · Jun 18, 2024

TwistedAndy said:
So we have a weird situation when dumb engineers at AMD, Intel, Apple, and other companies are wasting a lot of the core area on 6-8-wide reordering buffers, 6-8 wide schedulers, wide decoders, dozens of execution ports instead of focusing on the really important things like lowering the L1-cache latency, decreasing the pipeline length, and decreasing clocks 😉

Where did Doug say it's not worth the price? It definitely is worth the cost when the rest of your uarch is good enough. But you won't get 16% IPC increase by only increasing the width from 6 to 7.

TwistedAndy said:
Execution throughput is impacted mostly by the prediction accuracy. Latency and pipeline length become an issue when a CPU core often misses the execution branch.

The modern CPUs have pretty good branch predictors. For example, Zen 4 is capable of sustaining 6 micro-ops per cycle. The mispredictions cost nearly 10-15% of the throughput, according to Chips and Cheese:

View attachment 101361

If we take those numbers into account, the actual throughput for Zen 4 in Elder Scrolls is 5.5 uops per cycle. Obviously, it depends on the application, but games are the most branchy ones.

How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

This clearly shows that a real application has an IPC < 1.

TwistedAndy said:
Also, there is another article on the same site with the IPC table:

View attachment 101364
As you may see, AMD Zen 4 and Golden Cove are close to the 6 instructions per cycle (5.7, to be precise) for the most common instructions. Please note that in this table, we have the actual instructions, not uops.

This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.

TwistedAndy · Jun 18, 2024

Nothingness said:
But you won't get 16% IPC increase by only increasing the width from 6 to 7.

You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.

Nothingness said:
How do you go from an IPC of 0.9 to 5.5 uops per cycle? Don't tell me you multiplied the theoretical max of 6 instructions by 0.9 to get to 5.5?

The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.

There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.

Nothingness said:
This table is for a micro benchmark that measures rename optimizations. How does it show a non micro benchmark will achieve >5 uops per cycle? I never disputed the claim a CPU can reach its full width at moments. What is wrong is to derive from this that the sustained IPC will increase by 16% if your max uop per cycle goes from 6 to 7 on a real application.

Here's another benchmark for Golden Cove:

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.

It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.

Nothingness · Jun 18, 2024

TwistedAndy said:
You have to increase the pipeline width to achieve higher IPC. In the case of Zen 5, AMD will implement the 8-wide execution pipeline with nearly 7 uops per cycle average execution rate. So, instead of 5.7 IPC for Zen 4 from the table above, we will get 6.6-6.8 IPC in Zen 5.

I think the Intel Lion Cove will show similar numbers.

Yes you have to increase the width, but you also have to improve other areas unrelated to the width.

Third time I write it.

TwistedAndy said:
The theoretical max for Zen 4 is nearly 12 uops per cycle when both FP and INT schedulers are fully loaded. In the real-case scenario, there will be nearly 6 uops per cycle minus the branch mispredict penalty. That's how we get 5.5 IPC for games and 5.7 for synthetic workloads.

No you don't get an IPC of 5.5 in games. You seem unable to understand what ChipsAndCheese wrote.

ChipAndCheese clearly shows the "Achieved IPC" in game is 0.9 and adds "Both games experience low average IPC."

Second time I write it.

TwistedAndy said:
There is no sense in making the reorder buffers and rename/allocate blocks wider than the expected throughput. Their complexity grows exponentially to the number of ports. AMD has made them 6-wide because the expected throughput for Zen 4 was 6 uops per cycle.

Here's another benchmark for Golden Cove:
View attachment 101365

We see the same 5.7 IPC for 6-wide Golden Cove. Note that the results for XOR and SUB are nearly the same as those for MOV.

It's not surprising that we see similar IPC numbers for Zen 4 and Golden Cove, which both are 6-wide architectures. Actually, Zen 3 is also a 6-wide architecture but less optimized for some workloads.

This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

Second time I write it.

TwistedAndy said:
It may sound weird, but engineers at Intel, AMD, Apple, and many other companies are not dumb. There are reasons why Intel, for example, decided to have a 9-wide decoder, an 8-wide multiplexer for uops, and 8 ALUs in Skymont.

Captain obvious.

You seem to not read what people write. For the nth time: micro benchmarks don't represent a full application performance. You even pretend that GB doesn't represent anything and now you exhibit micro benchmarks that just tell nothing about app perf.

Now as you don't seem completely stupid, I wonder where our misunderstanding comes from.

DavidC1 · Jun 18, 2024

Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.

TwistedAndy · Jun 18, 2024

Nothingness said:
This a micro benchmark to mesure rename BW. It only measures that and doesn't represent application IPC.

When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.

DavidC1 said:
Estimating clocks for the Skymont cluster in Lunar Lake.

Yep, I think in the Lunar Lake, the Skymont clocks will be somewhere between 3.0-3.5 GHz with ~1.5-2 W per core. Lion Cove will use 3-4 W per core.

SiliconFly · Jun 18, 2024

DavidC1 · Jun 18, 2024

Lunarlake not only achieves 40% reduction in SoC power but PHY improved on memory to reduce by 40% as well.

SiliconFly said:
4.6GHz is also an unverified leak. I don't think we should get ahead of ourselves.For all we know, it might top out at just 4GHz!

Then Lion Cove will clock at 5GHz and be 10-15% slower in ST compared to Raptor Cove. No I believe it has a high chance to be correct. There's no indication clocks are lower on Skymont.

Nothingness · Jun 18, 2024

TwistedAndy said:
When did ADD and SUB become renamed? Also, MOV-elimination (rename) is not always possible. That's why we have a few cases with MOV in that table.

Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?

TwistedAndy · Jun 18, 2024

Nothingness said:
Anything that writes a register needs to be renamed (and on x86 that often requires the allocation of two registers: destination + flags). That's the basis of OoOE.

XOR/SUB reg,reg is subject to zero elimination. This saves allocation bandwidth. This is explained in section 4.1.5 of Intel Optimization Ref Manual (I didn't find anything in the Zen4 manual). Depending on how rename works, this can be implemented by using a physical register which is always zero.

You didn't answer to the rest of my post. Nothing to add about games reaching 5.5 IPC?

Zero elimination has it's own preconditions. It's not something that takes place all the time. And it's described in the Intel manual you are referring to. That's why there are a few cases with MOV operations in the table.

Also, there are add and subtract operations, which directly involved the whole execution pipeline. As we may see, both Intel Golden Cove and AMD Zen 4 are pretty close to the target execution width in multiple tests (5.7 vs 6 uops per cycle).

dullard · Jun 18, 2024

ondma said:
I heard sometime in July for Zen 5, not necessarily as late as the 31st. As for Intel, ARL may be formally announced in October, but the latest rumors I have heard are saying it may be early 2025 for significant availability. Considering the poor availability after the Mountain Lake release, the hype given to Lunar Lake, and the general lack of emphasis on ARL, I dont find that hard to believe at all. Hell, a major laptop manufacturer whose name slips my mind right now is expressing concern that Lunar Lake will not have good availability for the holiday season. I am confused too. I thought ARL was supposed to be the next big thing, while Lunar Lake was a niche product coming after ARL. Now it seems all the emphasis has shifted to LL, not a good sign for ARL, IMO.

Looks like July 15 for the Ryzen 300 Series laptop chips and either preorder or actual availability (unclear) of July 31 for the Ryzen 9000s. Since this is a discussion about Arrow Lake, which is best compared to the Ryzen 9000 line, I used July 31st. https://videocardz.com/newz/amd-ryz...00-sales-start-july-31-according-to-retailers

I don't follow how rumor of one complaint of Lunar Lake availability has much to do with Arrow Lake's release date and/or availability.

What is Mountain Lake?

Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?

TwistedAndy · Jun 18, 2024

dullard said:
Lunar Lake and Arrow Lake are totally different market segments. And Lunar Lake is launching sooner than Arrow Lake. So, why does focus on the sooner launching ultralight notebook Lunar Lake have anything to say about higher powered Arrow Lake desktop and higher power mobile chips?

If I follow your logic, that means if Ford promotes its upcoming redesigned Mustang released during a summer then that means that a F150 released a later fall is both bad and delayed until Winter?

There's a lot of attention paid to Lunar Lake because it clearly shows what to expect from Arrow Lake, which has mostly the same P- and E-cores.

We will get more details about Arrow Lake only in August.

Intel usually announces new desktop CPUs in October. As for mobile CPUs, they are announced at CES in January. There's no reason to think that this year will be different.

Hulk · Jun 18, 2024

DavidC1 said:
Estimating clocks for the Skymont cluster in Lunar Lake.

SpecInt scales at about 90% relative to clocks. With 38% gain in uarch, at the same power we're at about 3.1GHz and at max performance it's at 3.7GHz for single thread, compared to max 2.5GHz for Crestmont LPE.

For the MT version, Intel claims 2.9x with 2x the cores. Core count scaling is similar for the most part at about 90%, so it leaves 53% for uarch+clocks. That leaves us at 10% higher clocks for Skymont cluster with 2x the core count at the same power as MTL LPE, while having 38%/68% per clock gain. Quite impressive!

At max power it calculates out to be slightly under 60% higher clocks.

We know from the deleted leak that Skymont clocks at 4.6GHz in Arrowlake, so in no way we're going to see 5GHz like the extrapolated graph on X.

I really like how you backed into lunar lake clocks for skymont by using Intel peak to peak performance versus crestmont with crestmont clocks being known . Very clever!

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Senior member

Attachments

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Platinum Member

Member

Golden Member

Platinum Member

Diamond Member

Member

Elite Member

Member

Diamond Member