Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Page 579 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
846
799
106
Wildcat Lake (WCL) Preliminary Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing ADL-N. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q2/Computex 2026. In case people don't remember AlderLake-N, I have created a table below to compare the detail specs of ADL-N and WCL. Just for fun, I am throwing LNL and upcoming Mediatek D9500 SoC.

Intel Alder Lake - NIntel Wildcat LakeIntel Lunar LakeMediatek D9500
Launch DateQ1-2023Q2-2026 ?Q3-2024Q3-2025
ModelIntel N300?Core Ultra 7 268VDimensity 9500 5G
Dies2221
NodeIntel 7 + ?Intel 18-A + TSMC N6TSMC N3B + N6TSMC N3P
CPU8 E-cores2 P-core + 4 LP E-cores4 P-core + 4 LP E-coresC1 1+3+4
Threads8688
Max Clock3.8 GHz?5 GHz
L3 Cache6 MB?12 MB
TDP7 WFanless ?17 WFanless
Memory64-bit LPDDR5-480064-bit LPDDR5-6800 ?128-bit LPDDR5X-853364-bit LPDDR5X-10667
Size16 GB?32 GB24 GB ?
Bandwidth~ 55 GB/s136 GB/s85.6 GB/s
GPUUHD GraphicsArc 140VG1 Ultra
EU / Xe32 EU2 Xe8 Xe12
Max Clock1.25 GHz2 GHz
NPUNA18 TOPS48 TOPS100 TOPS ?






PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



LNL-MX.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,028
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,522
  • INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    181.4 KB · Views: 72,430
  • Clockspeed.png
    Clockspeed.png
    611.8 KB · Views: 72,318
Last edited:

MS_AT

Senior member
Jul 15, 2024
868
1,762
96
I think the fact that the combined performance is a meagre 9% is a testament to whatever they did not working, because if they boosted FP in a general way like on Skymont, the FP portion would have done lot better.

Not getting good gains on perf/clock is what I count as important. The P core team for forever talked a lot about what's a bottleneck or whatever but aside from Pentium M, Core 2, and Sandy Bridge, it was usually disappointing. 10% for Haswell, 10% for Skylake in an era where it was far easier to get big gains on process. What the hell were they doing?

Skymont is capable of legacy and FMA execution on all 4 ports. Even Lion Cove is only FMA on 2 pipes and other 2 are FP Add, so the gains won't be universal.
While it's commendable Skymont can do FMA on 4 ports in reality it is done only to match P core throughput. I am not sure what exactly you mean by legacy instructions but x64 did not have SIMD FMA instructions before FMA extension that was introduced at the same time that AVX was introduced. Because of that often compilers will assume FMA is present when you request AVX instruction sets as iirc at the time there was no CPU that would support FMA but would not support AVX. Why is this important? Because this was done over 10 years ago. So we either have software on the market that practically supports FMA and AVX together, what means it will almost exclusively use 256b version of FMA to match AVX or we have SSE only software that does not make use of FMA at all.

In first case Skymont will use all execution units to keep up with Raptor Cove, but Raptor can also do an addition at the same time, with Lion Cove following up with additional addition operation, while Skymont is fully occupied. In the second case FMA doesn't matter.

If we consider scalar FP operations, then yes Skymont will have an advantage over all P cores until Lion Cove that will be able to match it for mixed instruction streams [In my experience it's hard to encounter FMA only code], but due to different design goals it will loose to Lion Cove in absolute performance due to clock differences.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
With Raptor Lake, Intel held a 24 vs 16 "Full Core" advantage which, in theory, should have put that processor ahead of Zen 5 in terms of multi-threaded performance. Instead, it took a beating:


Based on your well thought out calculations, where do you see Arrow Lake performing in these multi-thread heavy workloads/benchmarks relative to Zen 5 and Raptor Lake?
I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.
 

DavidC1

Golden Member
Dec 29, 2023
1,833
2,960
96
Okay so Lion Cove architecture is closer to Skymont than Raptor Cove? I'm not arguing the point, just trying to understand.
No way.

-Instruction length data stored in L1i $
-Clustered decode
-OD-ILD, which decodes instruction length on the fly

-Very wide Retire, which is not just an attempt to just widen it but done because this also allows other structures to be decreased. An ideological difference of carefully adding and taking out as needed for area/power efficiency, unlike the P cores.
-More Store ALUs than Load ALUs. My guess is this benefits the clustered decode.
-Fast path hardware which can be power hungry versus Nanocode, which is having specific microcoded ROM added to each clusters. So it won't become suddenly 10x fast, but now it won't block other decode plus it can be parallelized. If you were ok with having an instruction that's dog slow, then you don't suddenly need it to be 10x faster than anything else. This is also a careful balance, unlike the brute force approach of P cores.
-The E cores go for many simple units made for a specific tasks over few powerful units that can do more, like on the P cores.

Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.

The E core team also doesn't shy away from changing things drastically. From Goldmont to Tremont, it had the L2 predecode cache. In Gracemont they took it out entirely and replaced it with the OD-ILD.

Since the L2 predecode was a rather large 128KB SRAM, taking it out was an efficient choice, and OD-ILD isn't limited by low hitrate on large datasizes, meaning it performs better. It's quite amazing to me that they took out an entire feature, added a new one while delivering substantial performance gain overall.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.
This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back;)
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,241
16,107
136
This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back;)
All I have to add, is that until its released and benchmarked, we will not really know how good it is.
 
  • Like
Reactions: Tlh97 and Hulk

DavidC1

Golden Member
Dec 29, 2023
1,833
2,960
96
The main difference in Skymont is clustered decoding without UOP cache. But this solution was used in Skymont not because it was better, but because it saved logic and complexity.
This is false. It saves a considerable amount of power, and area in the decode section, which can be used to boost elsewhere. Plus,

This is from the Intel x86 optimization manual:
This overall approach to x86 instruction decoding provides a clear path forward to very wide designs without needing to cache post-decoded instructions.
For Grace and Sky it has a load balancer so you don't need to add branch instructions, since if it doesn't happen for a long time the device adds fake branches so it can continue to execute in parallel.

Description about Skymont's decoder said that it is even capable of handling loops so it doesn't even need a loop buffer. Since x86 instructions encounter branches every 6 cycles, and it is easier to fill a 3-wide than wider ones, it'll often end up being better than the traditional approach.
 

DavidC1

Golden Member
Dec 29, 2023
1,833
2,960
96
For those less knowledable, what is an OD-ILD?
It is a feature introduced on Gracemont. It stands for On-Demand Instruction Length Decoder. It does exactly as it sounds, since x86 is a variable length decode.

Here's a comment on RWT about Clustered Decode, and why it's not just a "cheap" feature:
Note 1: The above is missing the main point of having _multiple_ N-wide decoders: the average basic block (=BB) size is approximately 5 x86 instructions (4 linear instructions + 1 BB-terminating branch instruction) and if we assume that 50% of those terminating branches are at run-time resolved as TAKEN branches then it implies that _all_ x86 CPUs in the near future will be _required_ to have _multiple_ N-wide decoders.
 

511

Diamond Member
Jul 12, 2024
4,520
4,137
106
Absolutely.

I do admit that I had forgotten that Netburst had only a 1 wide decoder. Pentium III and "Core 2" were absolutely better designs.

Reading through the litany of architectural changes in Lion Cove, it certainly appears that this core should be a rocket ship ..... but as with many designs (and I have quite a few under my belt), "It sure looked good on the white board". Actually, I am of the belief that Lion Cove (and its sister designs) will grow into a very successful core design for Intel ...... in a couple of years. Unlike the disaster that was Netburst and Bulldozer, I don't see anything fundamentally mis-calculated here, only a need for both process and design optimization.

Unfortunately, these things take time. Based on the information that we have at this time, Intel will not be "back on top" again until 18A and some design tweaks come about (a couple of years I think).

For those that think me "Anti-Intel", I am absolutely not. No sane person in the world would wish for anything other than strong competition in the market. Furthermore, and on a more personal note, I happen to be a US vet and a long time CPU architecture buff. I want a strong US IP for my country, and Intel is it.

Opinion: Intel needs to fire a bunch of business majors and focus on their product strategy (vs figuring out how to better leverage their monopoly position to maximize their profit). It is my opinion that Intel stagnated under a bunch of tight neck ties and desperately needs an engineering kick in the a$$. In their engineering lethargy, Intel have allowed TSMC to flank them severely. AMD simply hit a great combination of design and lithography advances available and executed on it. In theory, having a vertically integrated design and foundry process SHOULD have allowed Intel to dominate the industry indefinitely. It is only in Intel's horrendous lack of forward vision that AMD and TSMC have unseated them. I believe Pat G can put the company back on track ........ if he gets enough time. It's really hard to work your way through an army of pencil necks. [/end rant]
Definitely the Fab Leadership was lost due to some idiots not financing it enough and it took design down due to how inter dependent they were Pat G fixed Nodes something previous CEOs were incapable of
 
  • Like
Reactions: OneEng2

reb0rn

Senior member
Dec 31, 2009
320
120
116
The first rule of fight club/CPU forum is attack the post not the poster. No personal attacks.
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.
You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it
 

DrMrLordX

Lifer
Apr 27, 2000
22,901
12,968
136
no one care for some stupid bench helf baked done by tom
How about the Phoronix reviews involving Granite Rapids? There are at least two of them, one from when it launched (versus previous-gen server parts) and one more showcasing Granite Rapids versus the current competition. I haven't even looked at the Tom's review yet.
 

OneEng2

Senior member
Sep 19, 2022
840
1,105
106
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.
OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast :). Who knew :).

@Hulk : Please forgive my stupidity. I wont underestimate you again :).
I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.
... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").
You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it

How about the Phoronix reviews involving Granite Rapids? There are at least two of them, one from when it launched (versus previous-gen server parts) and one more showcasing Granite Rapids versus the current competition. I haven't even looked at the Tom's review yet.
@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast :). Who knew :).

@Hulk : Please forgive my stupidity. I wont underestimate you again :).

... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").



@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.
@OneEng2,
No worries! I didn't respond because I was pretty sure you didn't scroll down and see the punchline.
 

DavidC1

Golden Member
Dec 29, 2023
1,833
2,960
96
We all know the old line about the broken clock being right twice a day or the blind squirrel sometimes stumbling on a nut. Not MLID.
He's expecting 30-40% ST and 15-20% ST for Pantherlake desktop over Arrowlake.

I don't think Pantherlake desktop even exists? And Pantherlake is supposed to be basically an 18A shrink?

I admit, he got few things right but those numbers are purely made up.
 

sgs_x86

Junior Member
Dec 20, 2020
17
26
91
Did some further CB R24 MT testing. As you add P cores strange things happen. I ran these tests at 5GHz P, 4GHz E, just to make sure there is no throttling or other "at the limit" behavior.

Anyway, assuming Raptor Cove scores 22.6 points/GHz here are the scores as you add P cores to 16 E cores during the render.
1P+16E - 15.2 points/GHz for E's
2P+16E - 15.4
4P+16E - 14.7
6P+16E - 14.0
8P+16E - 13.1

Other than the increase from 1 to 2 P's, the IPC of the E's decreases as P's are added. Anybody have any reasoning for this behavior?

It is of course possible the IPC of the P's are also or only changing but I have found the P IPC to be relatively stable when testing various number of P's.

It's a rabbit hole not worth spending too much time on but I have a hard time leavng it alone...
I could be horribly wrong, so please correct me. When 1P core is enabled, the E-cores get to use most of the shared L3 cache. As more P-cores are enabled, their share of the L3 is reduced and so does the IPC. We know that E-cores love cache.
 
  • Like
Reactions: Tlh97 and Hulk
Jul 27, 2020
28,001
19,125
146
It could also be a power thing where at the same power limit, the E-cores just get less and less power per core as more P-cores get added.
 

Josh128

Golden Member
Oct 14, 2022
1,319
1,985
106
Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.

So there actually may have been something to that at some point, or still could be.
 

511

Diamond Member
Jul 12, 2024
4,520
4,137
106
Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.
Agreed but missing estimate by few% and ridiculous claims are two different things
 
  • Like
Reactions: lightmanek
Jul 27, 2020
28,001
19,125
146
But the frequency of the E's is holding steady?
Then it's either memory bandwidth starvation or ring traffic congestion. Can't be the former in CB R23. With CB R24, it's a possibility. The ring thing is kinda dangerous if you raise its frequency due to degradation possibility.
 

Hulk

Diamond Member
Oct 9, 1999
5,138
3,727
136
I think it is just one of the core negatives of the ringbus. The more stops it has, the worse it performs.
Could be. But the P's perform about the same as you add them, but they have much larger L2 per core and that may be enough for CB while the E might need more and have to go to the ring more.

You can make a CB "calculator" but you need a few versions to take into account the varying performance of the E's depending on the P+E configuration.