Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Page 673 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
911
829
106
Wildcat Lake (WCL) Preliminary Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing ADL-N. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q2/Computex 2026. In case people don't remember AlderLake-N, I have created a table below to compare the detail specs of ADL-N and WCL. Just for fun, I am throwing LNL and upcoming Mediatek D9500 SoC.

Intel Alder Lake - NIntel Wildcat LakeIntel Lunar LakeMediatek D9500
Launch DateQ1-2023Q2-2026 ?Q3-2024Q3-2025
ModelIntel N300?Core Ultra 7 268VDimensity 9500 5G
Dies2221
NodeIntel 7 + ?Intel 18-A + TSMC N6TSMC N3B + N6TSMC N3P
CPU8 E-cores2 P-core + 4 LP E-cores4 P-core + 4 LP E-coresC1 1+3+4
Threads8688
Max Clock3.8 GHz?5 GHz
L3 Cache6 MB?12 MB
TDP7 WFanless ?17 WFanless
Memory64-bit LPDDR5-480064-bit LPDDR5-6800 ?128-bit LPDDR5X-853364-bit LPDDR5X-10667
Size16 GB?32 GB24 GB ?
Bandwidth~ 55 GB/s136 GB/s85.6 GB/s
GPUUHD GraphicsArc 140VG1 Ultra
EU / Xe32 EU2 Xe8 Xe12
Max Clock1.25 GHz2 GHz
NPUNA18 TOPS48 TOPS100 TOPS ?






PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



LNL-MX.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,034
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,527
  • INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    181.4 KB · Views: 72,435
  • Clockspeed.png
    Clockspeed.png
    611.8 KB · Views: 72,321
Last edited:

511

Diamond Member
Jul 12, 2024
5,031
4,533
106
Thanks swan for initiating such a mid release and pat for doubling down on it
 

LightningZ71

Platinum Member
Mar 10, 2017
2,625
3,308
136
Also keep in mind the 288 core Clearwater Forest is 12 dies of 24 cores each, with 24 cores being composed of 6x quad core clusters.
I missed that detail. So, in a certain way, it's similar to a hypothetical 12CCD Epyc processor where each CCD is a 6 core CCX, but the cores are 'mont quads. Assuming that they get their I/O setup right, it should be broadly competitive.
 

Saylick

Diamond Member
Sep 10, 2012
4,097
9,576
136
I missed that detail. So, in a certain way, it's similar to a hypothetical 12CCD Epyc processor where each CCD is a 6 core CCX, but the cores are 'mont quads. Assuming that they get their I/O setup right, it should be broadly competitive.
Not quite as similar to EPYC, I think. CCD-to-CCD communication has to go through the IOD in Zen, while Intel uses a mesh interconnect, so each cluster talks to each other via the on-die network, where latency is based on the number of hops between nodes in the mesh. It was this way for Sapphire Rapids, Emerald Rapids, and Granite Rapids. I don’t see why it would change for what comes next, even if there are more compute tiles.

For P-core server products, there’s one network node per core. I’ll have to double check, but it would not surprise me if for E-core server products, there’s one network node per E-core cluster.
1731382988143.jpeg
 
  • Like
Reactions: Elfear

OneEng2

Senior member
Sep 19, 2022
951
1,163
106
How are you calculating per core performance is AVX-512 included in them than yes elso no for pure integer workloads which many are both will be similar integer performance per thread

I agree for VM part but there are customers who disables SMT so it will give them more physical cores to work with
On a sidenote not a single site benchmarks the accelerator in Silicons they are niche but have decent use cases
Zen 5c can operate on 1.4 threads at a time per core. Skymont can operate on 1 thread at a time. If the single core IPC of Zen 5c was exactly equal to Skymont, and they were clocked at the same speed, Zen 5c would still perform 1.4 times faster than Skymont. This would be the worst case in a MT application.

Additionally, if there are any AVX512 executions in the workload, Zen 5c gets another big boost.

That is where I got the 40-60% "guesstimate" or SWAG :).
Yeah, idk about that... Unless 18A is complete garbage(which is a possibility) 40-60% higher perf seems too optimistic outside niche HPC/AI workloads and Clearwater has more cores. Epyc Zen 4C had 50% higher perf per core... Against 8ch Sierra that had Crestmont and 100mb L3. CLF has Darkmont equipped with more L3 and likely faster 12ch DDR5. Crestmont was slower than Zen 4 in int perf let alone FP, but Skymont already narrowed that gap
Clearwater does have more cores, but each core can only operate on 1 thread while Zen 5c can operate on 1.4 threads at a time in a MT workload. Add in any AVX512 or FP tasks in the workload and it isn't hard to see each Zen 5c core performing 1.5 times as much as each Skymont core.

Someone show me where my math is off here. Seems like lots of people think I am off base (and I might be).
My biggest concern for Intel's very high core count 'mont server processors has nothing to do with the cores themselves, save if they can have a competitive AVX-512 implementation and remain compact and efficient enough, but is far more about Intel's mesh fabric connecting them all. 288 cores in clusters of 4 is still 72 reservation stations. How will the mesh affect performance for them?
That is a valid concern as well. Feeding 288 cores is a marvel all on its own. I think we are definitely looking at more bandwidth (and socket power) for future DC processors.
 
  • Like
Reactions: Tlh97
Jul 27, 2020
28,173
19,203
146
No need to be an Intel beta tester.
That's actually a great argument in trying to get a free 285K.

Dear Intel,

With my two prior RMA requests X and Y and now a third one, I think I have demonstrated quite consistently that I'm the sort of user who is perfectly suited to testing your CPUs and stressing them in normal workloads without any sort of overclocking involved. I think it would be prudent to let me have the 285K so the respective product teams can learn how actual users work in real life using Intel processors, instead of running Cinebench on a constant loop for X amount of hours and declaring a processor fit for public consumption.

Yours truly,
The "three times successful smasher of Intel CPUs" Hulk
 

jdubs03

Golden Member
Oct 1, 2013
1,333
935
136
That's actually a great argument in trying to get a free 285K.

Dear Intel,

With my two prior RMA requests X and Y and now a third one, I think I have demonstrated quite consistently that I'm the sort of user who is perfectly suited to testing your CPUs and stressing them in normal workloads without any sort of overclocking involved. I think it would be prudent to let me have the 285K so the respective product teams can learn how actual users work in real life using Intel processors, instead of running Cinebench on a constant loop for X amount of hours and declaring a processor fit for public consumption.

Yours truly,
The "three times successful smasher of Intel CPUs" Hulk
Heh, might work except for maybe that last sentence there.
Put the pressure on Mr. Banner!
 

Hulk

Diamond Member
Oct 9, 1999
5,213
3,842
136
That's actually a great argument in trying to get a free 285K.

Dear Intel,

With my two prior RMA requests X and Y and now a third one, I think I have demonstrated quite consistently that I'm the sort of user who is perfectly suited to testing your CPUs and stressing them in normal workloads without any sort of overclocking involved. I think it would be prudent to let me have the 285K so the respective product teams can learn how actual users work in real life using Intel processors, instead of running Cinebench on a constant loop for X amount of hours and declaring a processor fit for public consumption.

Yours truly,
The "three times successful smasher of Intel CPUs" Hulk
I also cap frequency at 5.5GHz and power at 200W and I still seem to burn them out.

Funny thing is, around 20 years ago I used to do a lot of beta testing and actually wrote a few "how to" books on some of the software I was testing. Even got to go to NAB in Vegas twice and speak as an "expert" on video editing and get paid for the trip. Then YouTube came and wiped out that market.
 
Jul 27, 2020
28,173
19,203
146
I also cap frequency at 5.5GHz and power at 200W and I still seem to burn them out.
Hey, maybe the CPU tries to compensate for the lack of power by overperforming in ST with higher than normal boosts since it isn't allowed to flex its muscles MT-wise? Definitely seems like some sort of boosting algo designed to win benchmarks in the short term.
 

DavidC1

Platinum Member
Dec 29, 2023
2,006
3,153
96
Not quite as similar to EPYC, I think. CCD-to-CCD communication has to go through the IOD in Zen, while Intel uses a mesh interconnect, so each cluster talks to each other via the on-die network, where latency is based on the number of hops between nodes in the mesh. It was this way for Sapphire Rapids, Emerald Rapids, and Granite Rapids. I don’t see why it would change for what comes next, even if there are more compute tiles.
While it uses the mesh, Clearwater Forest potentially has one big advantage over current server chips, and it's that it can use Foveros Direct to communicate, whereas now it's using EMIB.

Foveros Direct is the most advanced version:
1731392883307.png

So potentially the connections between the clusters can be much faster and with less power penalties.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,812
10,530
106
While it uses the mesh, Clearwater Forest potentially has one big advantage over current server chips, and it's that it can use Foveros Direct to communicate, whereas now it's using EMIB.
Nope. The bases and the I/O caps are still chained over EMIB.
Hybrid bonding just allows them to put cores in top of cache, same as PVC/MI300/Granite Ridge-X/you-name-it.
Plus you're still dealing with a rather xboxhueg mesh in any case, which is slow.
 

DavidC1

Platinum Member
Dec 29, 2023
2,006
3,153
96
Skymont is a beast. Lion Cove is a total letdown. I am really curious as to how much larger Skymont would be if they scaled it up to hit the same clocks as Lion Cove and gave it similar instruction set, e.g. AVX-512.
Better way is to make it wide as possible and keep it at 5GHz or below. Like 50% faster per clock.

Lion Cove and Skymont should also be able to put out few % more if the SoC itself did not suck.

@adroc_thurston That's a bit of a disappointment. I guess it's V-cache with cache as a base tile then.
 
  • Like
Reactions: Tlh97

DavidC1

Platinum Member
Dec 29, 2023
2,006
3,153
96
Each Clearwater Forest 24 core tile seems to be about 90mm2. That means the quad core cluster is little under 15mm2.

Total of 12 compute tiles is ~1400mm2. The base is Intel 3. They can put a LOT of cache underneath if they want to. I read it's around only 1/2 GB though? If they make it take up the space underneath, they can get 1GB of SRAM under there.
 
  • Like
Reactions: Tlh97

cannedlake240

Senior member
Jul 4, 2024
247
138
76
Each Clearwater Forest 24 core tile seems to be about 90mm2. That means the quad core cluster is little under 15mm2.

Total of 12 compute tiles is ~1400mm2. The base is Intel 3. They can put a LOT of cache underneath if they want to. I read it's around only 1/2 GB though? If they make it take up the space underneath, they can get 1GB of SRAM under there.
Apparently the 4x3 CLF Intels been showing isn't how the actual package looks like. Bionic on Twitter said a while ago that Clearwater has more than 2X L3 over SRF, so only twice the 216mb of the 288C Sierra. Base tiles house emib phys, memory controllers and the mesh fabric.

One could also take the "doubling of L3", as 2x over 144C SRF which would just be baffling lol... Imagine a 288C cpu with so much SRAM real estate only having a little over 200mb of L3. If that's the case they should at least double the cluster L2 as well, 8mb cluster L2 has been talked about since Tremont days
 
Last edited:
  • Like
Reactions: Tlh97

Saylick

Diamond Member
Sep 10, 2012
4,097
9,576
136
Better way is to make it wide as possible and keep it at 5GHz or below. Like 50% faster per clock.
Would be interesting to see how well the clustered decode approach scales. Why not just add another cluster of 3-wide decode at this point and then widen everything else downstream.
 

511

Diamond Member
Jul 12, 2024
5,031
4,533
106
I also cap frequency at 5.5GHz and power at 200W and I still seem to burn them out.

Funny thing is, around 20 years ago I used to do a lot of beta testing and actually wrote a few "how to" books on some of the software I was testing. Even got to go to NAB in Vegas twice and speak as an "expert" on video editing and get paid for the trip. Then YouTube came and wiped out that market.
If you had multiple defective cpu either you are getting wrong CPUs constantly or there somewhere something in MB causing issues my advice would be cap IA voltage to around 1.45V or maybe 1.5V depending on the desired frequency it will prevent degradation altogether on new cpu
How is zen 5 so fast in FP?
They significantly improved the FP performance full fat AVX-512 with more units to feed
 
Last edited:

511

Diamond Member
Jul 12, 2024
5,031
4,533
106
Each Clearwater Forest 24 core tile seems to be about 90mm2. That means the quad core cluster is little under 15mm2.

Total of 12 compute tiles is ~1400mm2. The base is Intel 3. They can put a LOT of cache underneath if they want to. I read it's around only 1/2 GB though? If they make it take up the space underneath, they can get 1GB of SRAM under there.
Dou you have a die shot available?
 

DavidC1

Platinum Member
Dec 29, 2023
2,006
3,153
96
Would be interesting to see how well the clustered decode approach scales. Why not just add another cluster of 3-wide decode at this point and then widen everything else downstream.
It is exactly what it says on the optimization manual for Gracemont.
This overall approach to x86 instruction decoding provides a clear path forward to very wide designs without needing to cache post-decoded instructions.
Can't be more optimistic than that.
-It saves on complexity, meaning less time
-It saves on transistors, meaning less power and area
-It can scale easily, while going above 8-wide traditionally is questionable
-Each clusters are only 3-wide, so easier to fill
-Works on both branches and loops
-Further opportunities for improvement, not just on the decode section but coupled with changes elsewhere.

There was an X post about Keller having worked on Intel's next architecture with 12-wide decode. This is likely Arctic Wolf.

And I doubt they're widening it by 33% to get 5-10% gains. That's not what they've been doing. Branch predictor on Skymont is 27% over Gracemont. FP is 20-30% more area for 30% extra performance.

Anandtech article about Atom said the design goals within that team were 1% power for 2% performance. Or the compactness of the core. You need to be very balanced to be like that. Can't go spending too much on one area and skimping on another. Would not be surprised if they bring some more new ideas to deliver on it.
 
Last edited:

DavidC1

Platinum Member
Dec 29, 2023
2,006
3,153
96
Dou you have a die shot available?
You don't need a die shot. You only need a shot of the package. You find the size of the LGA7529 package. The actual shot shows having 3x narrow dies. The narrowness is because the dies are right next to each other like on Meteorlake. Each narrow die is actually 4 of them. You find the size of the narrow geometry, and divide by 4.

It's possible 4 of the dies are connected by Foveros, and connection to another of the quad die groups are done using EMIB. Then again, SPR is using EMIB only and it's pretty close between the dies.

Same with how I got Turin Dense Zen 5c's size. You get the package size, and measure the die. There's a clear separation between core and L2 for AMD, so it's even easier to find core size only.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
23,062
13,164
136
Is it likely that until put more resources into the development of skymont because it does double duty, client and server? Whereas lion cove is pretty much just client.

Technically, Skymont won't see use in any server products. Darkmont will, though.

It is unfortunate that Intel doesn't have a Lion Cove-based server product and instead chose to use Redwood Cove in Granite Rapids.

Oh yeah! X3D didn’t. Turin didn’t. Same as GNR didn’t. And Lunar Lake didn’t.

Granite Rapids (assuming you're talking about that, and not Granite Ridge) isn't even using the same core as the 285k. It's Redwood Cove.

Lunar Lake doesn't share the same compute chiplet or even have the same package layout and is a niche product for 15W and below. It's good for what it is, but . . . not exactly the same thing!

Meanwhile, Turin, Granite Rapids, and Granite Rapids-X use the same CCDs. It's all the same product.

When Zen 5 desktop parts were launched it was a disaster

Why, because some reviewers didn't like the game performance on the 9950X? Please. It's the most lucrative disaster AMD ever had. And a few weeks later, X3D parts hit the streets and all was forgiven. Meanwhile, take a look at client market share for Q3 2024 and see what's really happening.
 
  • Like
Reactions: Tlh97 and misuspita