Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Tigerick · Aug 22, 2022

Wildcat Lake (WCL) Preliminary Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing ADL-N. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q2/Computex 2026. In case people don't remember AlderLake-N, I have created a table below to compare the detail specs of ADL-N and WCL. Just for fun, I am throwing LNL and upcoming Mediatek D9500 SoC.

	Intel Alder Lake - N	Intel Wildcat Lake	Intel Lunar Lake	Mediatek D9500
Launch Date	Q1-2023	Q2-2026 ?	Q3-2024	Q3-2025
Model	Intel N300	?	Core Ultra 7 268V	Dimensity 9500 5G
Dies	2	2	2	1
Node	Intel 7 + ?	Intel 18-A + TSMC N6	TSMC N3B + N6	TSMC N3P

CPU	8 E-cores	2 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores	C1 1+3+4
Threads	8	6	8	8
Max Clock	3.8 GHz	?	5 GHz
L3 Cache	6 MB	?	12 MB
TDP	7 W	Fanless ?	17 W	Fanless

Memory	64-bit LPDDR5-4800	64-bit LPDDR5-6800 ?	128-bit LPDDR5X-8533	64-bit LPDDR5X-10667
Size	16 GB	?	32 GB	24 GB ?
Bandwidth		~ 55 GB/s	136 GB/s	85.6 GB/s

GPU	UHD Graphics		Arc 140V	G1 Ultra
EU / Xe	32 EU	2 Xe	8 Xe	12
Max Clock	1.25 GHz		2 GHz

NPU	NA	18 TOPS	48 TOPS	100 TOPS ?

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

DrMrLordX · Oct 12, 2024

reb0rn said:
Is this AMD topic or I got lost?

Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.

MS_AT · Oct 12, 2024

DavidC1 said:
I think the fact that the combined performance is a meagre 9% is a testament to whatever they did not working, because if they boosted FP in a general way like on Skymont, the FP portion would have done lot better.

Not getting good gains on perf/clock is what I count as important. The P core team for forever talked a lot about what's a bottleneck or whatever but aside from Pentium M, Core 2, and Sandy Bridge, it was usually disappointing. 10% for Haswell, 10% for Skylake in an era where it was far easier to get big gains on process. What the hell were they doing?

Skymont is capable of legacy and FMA execution on all 4 ports. Even Lion Cove is only FMA on 2 pipes and other 2 are FP Add, so the gains won't be universal.

While it's commendable Skymont can do FMA on 4 ports in reality it is done only to match P core throughput. I am not sure what exactly you mean by legacy instructions but x64 did not have SIMD FMA instructions before FMA extension that was introduced at the same time that AVX was introduced. Because of that often compilers will assume FMA is present when you request AVX instruction sets as iirc at the time there was no CPU that would support FMA but would not support AVX. Why is this important? Because this was done over 10 years ago. So we either have software on the market that practically supports FMA and AVX together, what means it will almost exclusively use 256b version of FMA to match AVX or we have SSE only software that does not make use of FMA at all.

In first case Skymont will use all execution units to keep up with Raptor Cove, but Raptor can also do an addition at the same time, with Lion Cove following up with additional addition operation, while Skymont is fully occupied. In the second case FMA doesn't matter.

If we consider scalar FP operations, then yes Skymont will have an advantage over all P cores until Lion Cove that will be able to match it for mixed instruction streams [In my experience it's hard to encounter FMA only code], but due to different design goals it will loose to Lion Cove in absolute performance due to clock differences.

Hulk · Oct 12, 2024

OneEng2 said:
With Raptor Lake, Intel held a 24 vs 16 "Full Core" advantage which, in theory, should have put that processor ahead of Zen 5 in terms of multi-threaded performance. Instead, it took a beating:

AMD Ryzen 9 5950X and 5900X Review: Zen 3 Breaks the 5 GHz Barrier

AMD's Magnum Opus

www.tomshardware.com

AMD Ryzen 9 5950X and 5900X Review: Zen 3 Breaks the 5 GHz Barrier

AMD's Magnum Opus

www.tomshardware.com

Based on your well thought out calculations, where do you see Arrow Lake performing in these multi-thread heavy workloads/benchmarks relative to Zen 5 and Raptor Lake?

I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.

DavidC1 · Oct 12, 2024

Hulk said:
Okay so Lion Cove architecture is closer to Skymont than Raptor Cove? I'm not arguing the point, just trying to understand.

No way.

-Instruction length data stored in L1i $
-Clustered decode
-OD-ILD, which decodes instruction length on the fly

-Very wide Retire, which is not just an attempt to just widen it but done because this also allows other structures to be decreased. An ideological difference of carefully adding and taking out as needed for area/power efficiency, unlike the P cores.
-More Store ALUs than Load ALUs. My guess is this benefits the clustered decode.
-Fast path hardware which can be power hungry versus Nanocode, which is having specific microcoded ROM added to each clusters. So it won't become suddenly 10x fast, but now it won't block other decode plus it can be parallelized. If you were ok with having an instruction that's dog slow, then you don't suddenly need it to be 10x faster than anything else. This is also a careful balance, unlike the brute force approach of P cores.
-The E cores go for many simple units made for a specific tasks over few powerful units that can do more, like on the P cores.

Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.

The E core team also doesn't shy away from changing things drastically. From Goldmont to Tremont, it had the L2 predecode cache. In Gracemont they took it out entirely and replaced it with the OD-ILD.

Since the L2 predecode was a rather large 128KB SRAM, taking it out was an efficient choice, and OD-ILD isn't limited by low hitrate on large datasizes, meaning it performs better. It's quite amazing to me that they took out an entire feature, added a new one while delivering substantial performance gain overall.

Hulk · Oct 12, 2024

DavidC1 said:
Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.

This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back

Markfw · Oct 12, 2024

Hulk said:
This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back

All I have to add, is that until its released and benchmarked, we will not really know how good it is.

DavidC1 · Oct 12, 2024

AMDK11 said:
The main difference in Skymont is clustered decoding without UOP cache. But this solution was used in Skymont not because it was better, but because it saved logic and complexity.

This is false. It saves a considerable amount of power, and area in the decode section, which can be used to boost elsewhere. Plus,

This is from the Intel x86 optimization manual:

This overall approach to x86 instruction decoding provides a clear path forward to very wide designs without needing to cache post-decoded instructions.

For Grace and Sky it has a load balancer so you don't need to add branch instructions, since if it doesn't happen for a long time the device adds fake branches so it can continue to execute in parallel.

Description about Skymont's decoder said that it is even capable of handling loops so it doesn't even need a loop buffer. Since x86 instructions encounter branches every 6 cycles, and it is easier to fill a 3-wide than wider ones, it'll often end up being better than the traditional approach.

Thunder 57 · Oct 12, 2024

DavidC1 said:
No way.

-Instruction length data stored in L1i $
-Clustered decode
-OD-ILD, which decodes instruction length on the fly

For those less knowledable, what is an OD-ILD?

DavidC1 · Oct 12, 2024

Thunder 57 said:
For those less knowledable, what is an OD-ILD?

It is a feature introduced on Gracemont. It stands for On-Demand Instruction Length Decoder. It does exactly as it sounds, since x86 is a variable length decode.

Here's a comment on RWT about Clustered Decode, and why it's not just a "cheap" feature:

Note 1: The above is missing the main point of having _multiple_ N-wide decoders: the average basic block (=BB) size is approximately 5 x86 instructions (4 linear instructions + 1 BB-terminating branch instruction) and if we assume that 50% of those terminating branches are at run-time resolved as TAKEN branches then it implies that _all_ x86 CPUs in the near future will be _required_ to have _multiple_ N-wide decoders.

511 · Oct 12, 2024

OneEng2 said:
Absolutely.

I do admit that I had forgotten that Netburst had only a 1 wide decoder. Pentium III and "Core 2" were absolutely better designs.

Reading through the litany of architectural changes in Lion Cove, it certainly appears that this core should be a rocket ship ..... but as with many designs (and I have quite a few under my belt), "It sure looked good on the white board". Actually, I am of the belief that Lion Cove (and its sister designs) will grow into a very successful core design for Intel ...... in a couple of years. Unlike the disaster that was Netburst and Bulldozer, I don't see anything fundamentally mis-calculated here, only a need for both process and design optimization.

Unfortunately, these things take time. Based on the information that we have at this time, Intel will not be "back on top" again until 18A and some design tweaks come about (a couple of years I think).

For those that think me "Anti-Intel", I am absolutely not. No sane person in the world would wish for anything other than strong competition in the market. Furthermore, and on a more personal note, I happen to be a US vet and a long time CPU architecture buff. I want a strong US IP for my country, and Intel is it.

Opinion: Intel needs to fire a bunch of business majors and focus on their product strategy (vs figuring out how to better leverage their monopoly position to maximize their profit). It is my opinion that Intel stagnated under a bunch of tight neck ties and desperately needs an engineering kick in the a$$. In their engineering lethargy, Intel have allowed TSMC to flank them severely. AMD simply hit a great combination of design and lithography advances available and executed on it. In theory, having a vertically integrated design and foundry process SHOULD have allowed Intel to dominate the industry indefinitely. It is only in Intel's horrendous lack of forward vision that AMD and TSMC have unseated them. I believe Pat G can put the company back on track ........ if he gets enough time. It's really hard to work your way through an army of pencil necks. [/end rant]

Definitely the Fab Leadership was lost due to some idiots not financing it enough and it took design down due to how inter dependent they were Pat G fixed Nodes something previous CEOs were incapable of

reb0rn · Oct 12, 2024

DrMrLordX said:
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.

You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it

DrMrLordX · Oct 12, 2024

reb0rn said:
no one care for some stupid bench helf baked done by tom

How about the Phoronix reviews involving Granite Rapids? There are at least two of them, one from when it launched (versus previous-gen server parts) and one more showcasing Granite Rapids versus the current competition. I haven't even looked at the Tom's review yet.

OneEng2 · Oct 12, 2024

DrMrLordX said:
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.

OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast

. Who knew

.

@Hulk : Please forgive my stupidity. I wont underestimate you again

.

Hulk said:
I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.

... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").

reb0rn said:
You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it

DrMrLordX said:
How about the Phoronix reviews involving Granite Rapids? There are at least two of them, one from when it launched (versus previous-gen server parts) and one more showcasing Granite Rapids versus the current competition. I haven't even looked at the Tom's review yet.

@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.

DavidC1 · Oct 12, 2024

OneEng2 said:
Turns out transistors make a lot of heat when you switch them that fast . Who knew .

Not only that, we know now that we literally can't make a chip that fast.

World overclocking record is 9.1GHz with Raptorlake. Yea, that chip which breaks down if you just look at it.

Hulk · Oct 12, 2024

OneEng2 said:
OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast . Who knew .

@Hulk : Please forgive my stupidity. I wont underestimate you again .

... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").

@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.

@OneEng2,
No worries! I didn't respond because I was pretty sure you didn't scroll down and see the punchline.

Hulk · Oct 12, 2024

We all know the old line about the broken clock being right twice a day or the blind squirrel sometimes stumbling on a nut. Not MLID.

DavidC1 · Oct 12, 2024

Hulk said:
We all know the old line about the broken clock being right twice a day or the blind squirrel sometimes stumbling on a nut. Not MLID.

He's expecting 30-40% ST and 15-20% ST for Pantherlake desktop over Arrowlake.

I don't think Pantherlake desktop even exists? And Pantherlake is supposed to be basically an 18A shrink?

I admit, he got few things right but those numbers are purely made up.

sgs_x86 · Oct 13, 2024

Hulk said:
Did some further CB R24 MT testing. As you add P cores strange things happen. I ran these tests at 5GHz P, 4GHz E, just to make sure there is no throttling or other "at the limit" behavior.

Anyway, assuming Raptor Cove scores 22.6 points/GHz here are the scores as you add P cores to 16 E cores during the render.
1P+16E - 15.2 points/GHz for E's
2P+16E - 15.4
4P+16E - 14.7
6P+16E - 14.0
8P+16E - 13.1

Other than the increase from 1 to 2 P's, the IPC of the E's decreases as P's are added. Anybody have any reasoning for this behavior?

It is of course possible the IPC of the P's are also or only changing but I have found the P IPC to be relatively stable when testing various number of P's.

It's a rabbit hole not worth spending too much time on but I have a hard time leavng it alone...

I could be horribly wrong, so please correct me. When 1P core is enabled, the E-cores get to use most of the shared L3 cache. As more P-cores are enabled, their share of the L3 is reduced and so does the IPC. We know that E-cores love cache.

igor_kavinski · Oct 13, 2024

It could also be a power thing where at the same power limit, the E-cores just get less and less power per core as more P-cores get added.

AcrosTinus · Oct 13, 2024

igor_kavinski said:
It could also be a power thing where at the same power limit, the E-cores just get less and less power per core as more P-cores get added.

I think it is just one of the core negatives of the ringbus. The more stops it has, the worse it performs.

Josh128 · Oct 13, 2024

Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.

So there actually may have been something to that at some point, or still could be.

Hulk · Oct 13, 2024

igor_kavinski said:
It could also be a power thing where at the same power limit, the E-cores just get less and less power per core as more P-cores get added.

But the frequency of the E's is holding steady?

511 · Oct 13, 2024

Josh128 said:
Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.

Agreed but missing estimate by few% and ridiculous claims are two different things

igor_kavinski · Oct 13, 2024

Hulk said:
But the frequency of the E's is holding steady?

Then it's either memory bandwidth starvation or ring traffic congestion. Can't be the former in CB R23. With CB R24, it's a possibility. The ring thing is kinda dangerous if you raise its frequency due to degradation possibility.

Hulk · Oct 13, 2024

AcrosTinus said:
I think it is just one of the core negatives of the ringbus. The more stops it has, the worse it performs.

Could be. But the P's perform about the same as you add them, but they have much larger L2 per core and that may be enough for CB while the E might need more and have to go to the ring more.

You can make a CB "calculator" but you need a few versions to take into account the varying performance of the E's depending on the P+E configuration.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Senior member

Attachments

Lifer

Senior member

Diamond Member

Golden Member

Diamond Member

Moderator Emeritus, Elite Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member

Lifer

Senior member

Golden Member

Diamond Member

Diamond Member

Attachments

Golden Member

Junior Member

Lifer

Senior member

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member