Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Tigerick · Aug 22, 2022

Wildcat Lake (WCL) Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing Raptor Lake-U. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q1 2026.

	Intel Raptor Lake U	Intel Wildcat Lake 15W?	Intel Lunar Lake	Intel Panther Lake 4+0+4
Launch Date	Q1-2024	Q2-2026	Q3-2024	Q1-2026
Model	Intel 150U	Intel Core 7	Core Ultra 7 268V	Core Ultra 7 365
Dies	2	2	2	3
Node	Intel 7 + ?	Intel 18-A + TSMC N6	TSMC N3B + N6	Intel 18-A + Intel 3 + TSMC N6

CPU	2 P-core + 8 E-cores	2 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores
Threads	12	6	8	8
Max Clock	5.4 GHz	?	5 GHz	4.8 GHz
L3 Cache	12 MB		12 MB	12 MB
TDP	15 - 55 W	15 W ?	17 - 37 W	25 - 55 W

Memory	128-bit LPDDR5-5200	64-bit LPDDR5	128-bit LPDDR5x-8533	128-bit LPDDR5x-7467
Size	96 GB		32 GB	128 GB
Bandwidth			136 GB/s

GPU	Intel Graphics	Intel Graphics	Arc 140V	Intel Graphics
RT	No	No	YES	YES
EU / Xe	96 EU	2 Xe	8 Xe	4 Xe
Max Clock	1.3 GHz	?	2 GHz	2.5 GHz

NPU	GNA 3.0	18 TOPS	48 TOPS	49 TOPS

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

poke01 · Nov 6, 2025

511 said:
AMD/Intel puts a lot of xtor for SIMD Apple uses that budget for Integer Performance

don’t Apple also have good FP especially from M4 onwards?
Not as great as AMDs but good enough.

511 · Nov 6, 2025

poke01 said:
don’t Apple also have good FP especially from M4 onwards?
Not as great as AMDs but good enough.

SIMD != FP

DrMrLordX · Nov 6, 2025

Khato said:
We don't know whether PTL was designed to hit 5GHz or 5.5GHz.

Why would they design Panther Lake to hit 5 GHz?

poke01 · Nov 6, 2025

511 said:
SIMD != FP

Yeah look at the dieshot of Meteor lake cpu from Semianalysis, AVX-512 occupies a lot of space in redwood cove.

Maybe there should be some SKUs where ST performance is given preference over SMID especially in products like Panther Lake and Lunar.

vanplayer · Nov 6, 2025

Khato said:
We don't know whether PTL was designed to hit 5GHz or 5.5GHz.

PTL was initially designed to hit 5.4-5.5Ghz.

coercitiv · Nov 6, 2025

DrMrLordX said:
Why would they design Panther Lake to hit 5 GHz?

To lower performance. /joke

Backside power delivery changed the design rules and maybe even Intel's priorities for a mobile product: they may have favored density versus best frequency. Listen to this 3 min segment on 18A transition from the perspective of a chip architect. I'll try to summarize via bullet points:

power delivery is no longer embedded, that gives us higher density and better vdroop (lower resistance)
in traditional designs, power routes were used to shield signal paths, which enabled higher frequencies
power routes are gone, so now we have they rely on spacing and make sure "we're not routing things next to each other"
"it's not terribly complicated, it's just a change"

My interpretation of the above is they get a choice between reduced density gains to keep SNR or reduced clocks to capitalize more on density gains. The price to pay is probably double, there's a penalty for clocks due to noise and another one due to thermals as density increases.

In the end this gets compounded with 18A yields/quality. Based solely on their lineup and the drop in fmax across SKUs, I think yields aren't helping.

511 · Nov 6, 2025

vanplayer said:
PTL was initially designed to hit 5.4-5.5Ghz.

Source?

511 · Nov 6, 2025

poke01 said:
Yeah look at the dieshot of Meteor lake cpu from Semianalysis, AVX-512 occupies a lot of space in redwood cove.

Maybe there should be some SKUs where ST performance is given preference over SMID especially in products like Panther Lake and Lunar.

Even Zen 5 look at the ports occupying significant area

Meteor Late · Nov 6, 2025

adroc_thurston said:
Oh they are binned.

Ok, let's clarify, they are binned wayy less aggresively to hit those frequencies. Meaning, you can get 4.6GHz in an enormous number of samples, unlike Intel's 5.1GHz chip, which will be in very limited quantity and the most common chips we will see will be the sub 5GHz ones.

OneEng2 · Nov 6, 2025

poke01 said:
Yeah look at the dieshot of Meteor lake cpu from Semianalysis, AVX-512 occupies a lot of space in redwood cove.

Maybe there should be some SKUs where ST performance is given preference over SMID especially in products like Panther Lake and Lunar.

It's my understanding that Intel intends to eventually add AVX10 to the "mont" processors; however, I believe they will be relegated to a 128bit wide implementation (verses 512) to save space/power.

This is why I scoff when people claim that "mont" will simply become a P core (complete with SMT) and retain its current size advantage and thermal properties.

511 · Nov 6, 2025

OneEng2 said:
It's my understanding that Intel intends to eventually add AVX10 to the "mont" processors; however, I believe they will be relegated to a 128bit wide implementation (verses 512) to save space/power.

This is why I scoff when people claim that "mont" will simply become a P core (complete with SMT) and retain its current size advantage and thermal properties.

It can become the P core while being smaller than P core that's entirely possible it would a bit larger than current atom implementation but still smaller the the Cove.

OneEng2 · Nov 6, 2025

511 said:
It can become the P core while being smaller than P core that's entirely possible it would a bit larger than current atom implementation but still smaller the the Cove.

It is demonstrably true that Intel's P cores are less PPA than AMD's P Cores. This is a decent comparison as well since both cores accomplish mostly the same thing (although Zen still supports AVX512).

So one could say that the Intel P core architecture must have some downfalls compared to Zen 5.

There really isn't an AMD version of "mont" as "mont" does way fewer things than a Zen 5 core does. This makes it much more difficult to know, with any certainty, what "mont" would look like if it DID support everything.

511 · Nov 6, 2025

OneEng2 said:
It is demonstrably true that Intel's P cores are less PPA than AMD's P Cores. This is a decent comparison as well since both cores accomplish mostly the same thing (although Zen still supports AVX512).

So does Intel's P core it's just fused off it's physically present

OneEng2 said:
There really isn't an AMD version of "mont" as "mont" does way fewer things than a Zen 5 core does. This makes it much more difficult to know, with any certainty, what "mont" would look like if it DID support everything.

AMD didn't have the resource to create two cores so they created one and scaled it according to needs Intel dis but now they are running out of resources to have two separate cores .

OneEng2 · Nov 6, 2025

511 said:
So does Intel's P core it's just fused off it's physically present

That would explain some of the bloat. Thanks.

511 said:
AMD didn't have the resource to create two cores so they created one and scaled it according to needs Intel dis but now they are running out of resources to have two separate cores .

I am still questioning the strategy of making 2 totally different architecture cores and gluing them together in a single processor for doing essentially the same kinds of tasks.

It's different when you make a core designed for specific tasks (like graphics or AI). But two cores that are different doing the same job?

It will be VERY interesting to see how the 288c Clearwater Forest holds up against the Venice D 192c. On paper, a single Zen 5 core is worth about 1.4 "mont" cores. With this metric, the 288c part wins.

Can't wait to see a detailed writeup on these two.

Meteor Late · Nov 6, 2025

OneEng2 said:
I am still questioning the strategy of making 2 totally different architecture cores and gluing them together in a single processor for doing essentially the same kinds of tasks.

Getting the best ST possible + getting the most MT in the least area possible.

poke01 · Nov 6, 2025

Meteor Late said:
Getting the best ST possible + getting the most MT in the least area possible.

It works. It really does but only in desktop.

poke01 · Nov 6, 2025

There is a fundamental issue with Intels CPU strategy in laptop and server.

Unlike desktop, you cannot power your way thru server and mobile where perf/w is paramount.

Josh128 · Nov 6, 2025

OneEng2 said:
It will be VERY interesting to see how the 288c Clearwater Forest holds up against the Venice D 192c. On paper, a single Zen 5 core is worth about 1.4 "mont" cores. With this metric, the 288c part wins.

Can't wait to see a detailed writeup on these two.

Im more interested to see how the 256C Venice D compares to the 288c CWF, in both performance and power efficiency. 288 is exactly 12.5% more than 256, so core for core comparisons will be pretty straightforward between the two. I suspect it will be a bludgeoning in favor of Zen 6D.

alcoholbob · Nov 6, 2025

OneEng2 said:
It's an interesting concept I suppose.

I guess I believe that in this day and age, there aren't any mysterious CPU architectures that magically work better than everything that came before it.

I believe that Apple, Intel, and AMD all have equivalent engineering teams and tools. The difference is what you target your architecture to do and what things you decide to prioritize and what things you decide to give up.

I believe you can't say I want it all and actually get it all. If you say I want a core that is very power efficient, you can't also say I want a core that clocks higher than the competition.

You can't say I want the core to be very small AND I want 4 way SMT, AVX512, etc, etc.

I do agree that Lion Cove appears to have lower PPA than Zen 5, although I think ARL in general gets a pretty bad rap on the basis of its poor showing in latency sensitive applications (which is mostly a ring bus issue IMO vs a core problem).

I think it is a pretty tall order to take ANY derivative of Skymont and make it compete with Zen 5 across the board. I think you can make it do some things better, but at the expense of doing other things worse.

I just don't see getting something for nothing in engineering.

I assume “ringbus issue” is not synonymous with ringbus frequency? Because you can OC the ringbus to match Raptor Lake ring frequency and it’s still miles slower. In fact even when frequency normalized and ring frequency normalized, hell even if you overclock both of these faster than a comparable RPL processor, it’s still slower than a similarly tuned Raptor Lake CPU by a good 20% in gaming if you look at Skatterbencher or DannyZReviews numbers. To me it’s got to be some kind of cache latency or die-to-die communucation problem.

511 · Nov 6, 2025

Josh128 said:
Im more interested to see how the 256C Venice D compares to the 288c CWF, in both performance and power efficiency. 288 is exactly 12.5% more than 256, so core for core comparisons will be pretty straightforward between the two. I suspect it will be a bludgeoning in favor of Zen 6D.

Pretty sure Venice Dense will win much more newer arch + HT + AVX-512 VD Competitor is like DMR.

DavidC1 · Nov 7, 2025

511 said:
Pretty sure Venice Dense will win much more newer arch + HT + AVX-512 VD Competitor is like DMR.

Clearwater Forest was a formidable part in it's original timeline. As it now stands, it's at least 6 months delayed. I speculate the delays are also going to result in slightly lower specs than original, meaning it'll be both late and worse. Just like Knights Landing Xeon Phi which got 9 month delay plus 10% higher power and 10% lower performance. Now, CWF has at best, 6 months TTM advantage over Venice, which is bad. Then conveniently sales will tank and they will say "noone wanted E core server" when it's because of execution failures.

OneEng2 said:
I guess I believe that in this day and age, there aren't any mysterious CPU architectures that magically work better than everything that came before it.

I just don't see getting something for nothing in engineering.

Isn't that ignoring the reality that there are always people that are better than you at something, which gets compounded when you multiply that as with a team? There are *always* someone, something, some group that's better, faster, stronger than others. They just are. And that gets compounded when you are talking about a group of people, a team. You have seen masterchef? How some just fumble? Aren't they given same materials and tools as the rest? How do some consistently do good and some fumble? Aren't they all under the same "engineering limits"?

alcoholbob said:
I assume “ringbus issue” is not synonymous with ringbus frequency? Because you can OC the ringbus to match Raptor Lake ring frequency and it’s still miles slower. In fact even when frequency normalized and ring frequency normalized, hell even if you overclock both of these faster than a comparable RPL processor, it’s still slower than a similarly tuned Raptor Lake CPU by a good 20% in gaming if you look at Skatterbencher or DannyZReviews numbers. To me it’s got to be some kind of cache latency or die-to-die communucation problem.

Things like ringbus are a high level block. There are many details that aren't being told, those resulting from countless decisions that will never be seen by the public. The chaos of multiple CEO changes, process issues, employee turnovers, and braindrain is all having an issue. So even if you have the ring at same frequency, it's just not same as before. Maybe the ring is ok, but the caches are not. Maybe the routers are underperforming. The "chaos" we have seen are internal problems being so problematic that it's oozing out externally. So we get to see glimpses of how bad it is.

There are also much simpler factors at play. If you are an engineer at Intel, the company that the public sees as a worse counterpart in a fading market(whether that is true or not, but perception), and you hear of endless layoffs, how motivated, or how hard will you be working? Roughly 2/3rds of employees said they will leak NDAs if they are laid off/fired by a company and if they see that action as being unjust. Now carry that though and actions at a company level with 100K people.

I think it can be simultaneously true that 18A may be ok yet it has underperformed in a product. Because it's not enough that 18A is ok, it's just a baseline to create something. And because even in the most optimistic scenario, 18A is still not good enough to spend porting by 3rd party company, we only have Intel products to base it on, so regardless of whether it's entirely the fault of 18A, a combo of 18A + design, or entirely the design, the end result is the same. If it'll disappoint it will disappoint, and if it's not it won't.

DavidC1 · Nov 7, 2025

dullard said:
I'll ask again, what is the 18A Fmax when power isn't limited? How does that compare to Intel's other nodes?

Dullard, @Meteor Late already responded to you in post #22,721. Let me jog your memory.

He said:

That's completely irrelevant, because those power levels are only reached in multithreaded apps, none of these are going to consume 45W or more in 1T scenarios, which is where those frequencies happen. that is to say, Panther Lake doesn't have lower fmax just because it has a lower PL2, that's not how it works.

Lunarlake has 37W PL2 at max, but it uses ~10W to get 5.1GHz Turbo. That's not even close to the low end TDP(PL1) limit of 15W. Not even desktop processors need even 40W to reach peak Turbo, whether you are talking 5.7GHz Arrowlake or 6GHz Raptorlake.

So we can rule out power as fmax limits. The culprit is elsewhere.

511 said:
SIMD != FP

Well, it's being used for the same purpose.

poke01 said:
Yeah look at the dieshot of Meteor lake cpu from Semianalysis, AVX-512 occupies a lot of space in redwood cove.

Maybe there should be some SKUs where ST performance is given preference over SMID especially in products like Panther Lake and Lunar.

This is what I mean AVX-512 should have been AVX3-256. Even in servers the 512-bit gets you only ~15% in average. It's the ISA that's bringing you most of the gains over AVX2. You hurt everything else just to satisfy the HPC crowd. In CPUs you are bound by everything else - you could have more L/S units, more memory bandwidth, more cache bandwidth, which speeds up every other application. I've seen some tests for a Xeon where it said it got ~10% of theoretical performance based on Flops. Heck, if you use Intel Linpack and test for Flops it doesn't even get 100% of the theoretical Flops on a test that is designed to test Flops!

Now I can speculate the way they justify extra space for 512-bit over 256-bit is they look at the core size and say, if it delivers 15% perf for 15% area, it makes sense. But AVX-512 only offers that in select set of applications. What if that 15% area was spent on beefing up general purpose performance instead?

What is better overall?
A) 256-bit AVX with 1.15x in everything
B) 512-bit AVX, but 1x in everything

Remember option A means that 1.15x is a boost in AVX as well. So if an application brought 30% by going to 512-bit over 256-bit, the 1.15x gain means you are comparing 1.3/1.15, or 13% disadvantage in AVX-512 favoring application, while being 15% faster in everything else.

511 · Nov 7, 2025

DavidC1 said:
Well, it's being used for the same purpose.

You can do Int SIMD as well ....
Int/float are data types

DavidC1 said:
Lunarlake has 37W PL2 at max, but it uses ~10W to get 5.1GHz Turbo. T

15W package

Joe NYC · Nov 7, 2025

DavidC1 said:
This is what I mean AVX-512 should have been AVX3-256. Even in servers the 512-bit gets you only ~15% in average. It's the ISA that's bringing you most of the gains over AVX2. You hurt everything else just to satisfy the HPC crowd. In CPUs you are bound by everything else - you could have more L/S units, more memory bandwidth, more cache bandwidth, which speeds up every other application. I've seen some tests for a Xeon where it said it got ~10% of theoretical performance based on Flops. Heck, if you use Intel Linpack and test for Flops it doesn't even get 100% of the theoretical Flops on a test that is designed to test Flops!

Now I can speculate the way they justify extra space for 512-bit over 256-bit is they look at the core size and say, if it delivers 15% perf for 15% area, it makes sense. But AVX-512 only offers that in select set of applications. What if that 15% area was spent on beefing up general purpose performance instead?

What is better overall?
A) 256-bit AVX with 1.15x in everything
B) 512-bit AVX, but 1x in everything

Remember option A means that 1.15x is a boost in AVX as well. So if an application brought 30% by going to 512-bit over 256-bit, the 1.15x gain means you are comparing 1.3/1.15, or 13% disadvantage in AVX-512 favoring application, while being 15% faster in everything else.

That's water under the bridge, at this point.

What is interesting is what comes next. For example, on AMD side, after spending the die area on full AVX-512 in Zen 5 and had to squeeze it in on N4 node, what comes next, with 2 node jump and a jump in transistor budget?

Vast majority of the transistor budget was spent on extra cores. And same story on Intel side with NVL.

Maybe the path of throwing more transistors to extract more ST performance (mentioned above in the tread) is very constrained...

DavidC1 · Nov 7, 2025

Joe NYC said:
That's water under the bridge, at this point.

What is interesting is what comes next. For example, on AMD side, after spending the die area on full AVX-512 in Zen 5 and had to squeeze it in on N4 node, what comes next, with 2 node jump and a jump in transistor budget?

N2 isn't really a 2 node jump, more like 1 at best. N2 gives you almost no density improvement. And this is using 2010+ standards, not in the golden age. Compared to the golden age of scaling we're talking 0.5x.

The video that @coercitiv linked about 18A, continues the trend of less gains requiring more effort. We went from Golden age of scaling to dung-level.

Joe NYC said:
Vast majority of the transistor budget was spent on extra cores. And same story on Intel side with NVL.

Maybe the path of throwing more transistors to extract more ST performance (mentioned above in the tread) is very constrained...

It's not that constrained. If you expand everything, the decoders, the ports, the units, you'll get that performance. Even Keller said that you can have easily 1K RoB CPUs. Intel is much closer there, but their monts aren't, and AMD isn't there either. Google said the decode limit being at 2-3 is wrong because there are many code that can do, 6, 8, or even more. Many also said decode isn't for average, but for peaks. Also the average ILP has increased compared to the older days where everything was slower and smaller such as branch prediction and caches. So you can't take the knowledge from Pentium Pro days and apply it to today, because even the original Raspberry Pi computers are better than that. No one expands decode while leaving everything else. It's a general purpose CPU so you have to beef up everything else anyway. So extra decoders come with more and faster BTBs, better branch algorithms, faster and wider caches, more compute units, more buffers, more L/S units.

Actually some expressed doubt there will be greater gains after Athlon 64! The trick is not doing the Intel P core way, which causes bloat. You need innovation otherwise which doesn't fall from the sky but new ideas from engineers. This is what @OneEng2 overlooks. The E core had new innovation to get that "catchup". Predecode bits in L1, multi-level predecode cache, Clustered decode, nanocode, they are all new ideas. Sandy Bridge was good too, because it brought physical register files and uop caches. Further efficient gains will need additional new ideas.

Let's imagine a hypothetical future where CPUs are like everything else, meaning no more Moore's Law. The last process node ever to come has been 2 years ago. What do you do? No gains from now on? Absolutely not. It will continue to get refined. Even internal combustion engines get better, and there's no such thing as Moore's Law there. And some believe it can improve much further with things like rotary valve engines.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Senior member

Attachments

Diamond Member

Diamond Member

Lifer

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Banned

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member