Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Page 363 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
941
857
106
Wildcat Lake (WCL) Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing Raptor Lake-U. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q1 2026.

Intel Raptor Lake UIntel Wildcat Lake 15W?Intel Lunar LakeIntel Panther Lake 4+0+4
Launch DateQ1-2024Q2-2026Q3-2024Q1-2026
ModelIntel 150UIntel Core 7Core Ultra 7 268VCore Ultra 7 365
Dies2223
NodeIntel 7 + ?Intel 18-A + TSMC N6TSMC N3B + N6Intel 18-A + Intel 3 + TSMC N6
CPU2 P-core + 8 E-cores2 P-core + 4 LP E-cores4 P-core + 4 LP E-cores4 P-core + 4 LP E-cores
Threads12688
Max Clock5.4 GHz?5 GHz4.8 GHz
L3 Cache12 MB12 MB12 MB
TDP15 - 55 W15 W ?17 - 37 W25 - 55 W
Memory128-bit LPDDR5-520064-bit LPDDR5128-bit LPDDR5x-8533128-bit LPDDR5x-7467
Size96 GB32 GB128 GB
Bandwidth136 GB/s
GPUIntel GraphicsIntel GraphicsArc 140VIntel Graphics
RTNoNoYESYES
EU / Xe96 EU2 Xe8 Xe4 Xe
Max Clock1.3 GHz?2 GHz2.5 GHz
NPUGNA 3.018 TOPS48 TOPS49 TOPS






PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



LNL-MX.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,043
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,531
  • INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg
    181.4 KB · Views: 72,439
  • Clockspeed.png
    Clockspeed.png
    611.8 KB · Views: 72,326
Last edited:

TwistedAndy

Member
May 23, 2024
159
150
76
Apple is boring because their cores are already good. To the point where Apple’s P core is more efficient and more powerful clock per clock than Skymont and LionCove.

Apple M1 was really good, but if we compare M4 to M1, the difference is nearly 8% on the same clock in Geekbench 5 (149.3% * 3.19 / 4.4 - 100% = 8.2%, link to results). If we use SPEC 2017, the IPC difference will be nearly the same (~11%).

If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.

As for power efficiency, it's the most misleading metric out there because the performance/power curve is not linear. There's a point of maximum efficiency, but device manufacturers often push the power limits higher to achieve a slight performance increase. Technically, if Apple decides to run M4 on 4 GHz instead of 4.4GHz, the power consumption will nearly match M3 (5-6W), but the performance will be only 7-8% higher in SPEC 2017.

As for Skymont, the "sweet spot" will be nearly 3-5W at 4.0-4.6 GHz. I don't think it makes sense for Intel to push it even higher, but who knows?

Skymont is great because Crestmont sucked and it can no longer be considered an Atom core. We have yet to see how Lunar Lake performs in benchmarks and applications.

Crestmont is a further improvement to Gracemont, which was a huge step compared to Tremont. Skymont is another huge step forward.

P core team need to take more ideas from Stevens's e core team.. to make the p core even more area efficient like skymont

Actually, they take some ideas from the E-cores (split INT/FP scheduler, wider retire, etc.), but it's very hard to refactor a large core. Also, Intel may decide to test some ideas on E-cores first before bringing them on P-cores.
 

Nothingness

Diamond Member
Jul 3, 2013
3,367
2,459
136
From an architectural standpoint, Skymont was built as a P-core and is not so different from Apple's P-cores or ARM Cortex-X2.

It can decode 9 instructions per cycle using three non-blocking decoding clusters. If there's a complex instruction requiring microcode reading, the other decoding clusters are not blocked. Intel calls it "nanocode," and on paper, it looks nice.

Apple P-cores can decode from 8 (M1) to 10 (M4) instructions per cycle.

The backend was built to handle 8 micro-ops per cycle. It's the same amount as P-cores in Apple M1 and M2.

Also, Intel decided to increase the retirement capability to 16 micro-ops per cycle in Skymont. It allowed Intel to use various buffers, queues, and register files more efficiently and avoid increasing their size too much. For comparison, the P-cores in Apple M4 can retire 10 micro-ops per cycle.

Another interesting part is the number of execution ports. Skymont has 26 of them, including 8 ALU ports, 4 128b FP ports, 3 load/4 store AGU, 3 load, and 2 store ports.

For comparison, the P-core of Apple M4 allegedly has 8 ALU, 4 FP ports, 1 port for FMA, 3 load/2 store AGU, 3 load, and 2 store ports.

Yes, we can't directly compare architectures just by the width of the decoder, execution width, buffer sizes, and the number of ports, but it can give us a rough picture of the capabilities.

If all other parts of the Skymont were balanced well, we could get ISO performance similar to the P-cores in Apple M2.
Let's see when Skymont is out. As you wrote, width isn't the whole story and many "details" could make the width useless. But yeah from what we know it looks solid.

FWIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.
 

TwistedAndy

Member
May 23, 2024
159
150
76
WIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.

Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.

That's the same number as what we have in Skymont (with the FMA support on all of them).

As for Lion Cove, it also has four FP ports, but they are wider (256b).
 

Attachments

  • GPPNrzKWEAAQgUj.jpg
    GPPNrzKWEAAQgUj.jpg
    576.2 KB · Views: 17
  • GQLH5OUXMAAbzSq.jpg
    GQLH5OUXMAAbzSq.jpg
    501.8 KB · Views: 17

Nothingness

Diamond Member
Jul 3, 2013
3,367
2,459
136
Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.
Proof here that there are 4 FMA: https://scalable.uni-jena.de/opt/sme/micro.html
M4 runs at 111 GFLOPS FP32: that is close the theoretical 4.4 GHz * 2 ops per FMA * 4 FP32 * 4 FMA = 141.

As for Lion Cove, it also has four FP ports, but they are wider (256b).
But only two FMA so that's the same FMA bandwidth as M1+.
 
  • Like
Reactions: SarahKerrigan

Magio

Senior member
May 13, 2024
206
243
76
Actually 20A is expected to be slightly better than N3 & 18A slightly better than N2.

It depends on what aspect of the process we're looking at, both 20A and 18A have GAAFET and backside power delivery, while only N2 has GAAFET and neither N3 nor N2 have BSPD, but transistor density is expected to remain better on the TSMC side.

Backside power and GAAFET are big innovations, but it would be reasonable to expect the first processes and chipmakers to ship products leveraging them to not fully exploit their potential right out of the gate, so TSMC's (likely) superior density on N2 might be worth more than Intel's BSPD on 18A for example.
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,762
106
Panther Lake = RTX 4050 sounds a bit too high. Maybe around 3050 Ti or a 3060 is my guess
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
 

hemedans

Senior member
Jan 31, 2015
305
177
116
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
Probably because they starve, no enough bandwidth.
 

TwistedAndy

Member
May 23, 2024
159
150
76
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

There are many reasons. The most obvious one is the memory bandwidth.

CPU by itself does not require huge memory bandwidth. It should be good enough with a small latency.

For example, even in the Apple M3 Max, a single P-core can utilize ~120GB/s. All the cores can use nearly 240GB/s in total. It's a limitation of fabric, and that's fine.

On the other side, GPUs are not so sensitive to latency, but they need a lot of bandwidth. When we want to combine a CPU with a powerful GPU, we need to increase the number of memory channels. Even nVidia RTX xx50 performance requires at least a 256-bit memory bus. It does not make sense for most desktops and laptops.

For example, Strix Halo, with a pretty promising GPU, is expected to have a 256-bit bus with soldered memory.

Another problem is the cost and flexibility. There are not so many customers willing to pay much more to have xx50 class GPU on their CPU because they plan to use something more powerful (xx70, xx80, and xx90) from another vendor.
 
Last edited:

Ghostsonplanets

Senior member
Mar 1, 2024
774
1,228
96
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
Yes? You want a <200mm² SoC part to match a dedicated 140mm² Graphics die fabbed on modern node, with modern uArch? Which also being very efficient, achieving high clocks and being supplied by dedicated memory and high amounts of bandwidth.

The SoC iGPU needs to share limited power budget, memory and bandwidth with the others parts of the SoC. The fact we have such big and reasonable fast iGPUs is already quite the feat.

To match a modern dedicated dGPU part, you need either to dedicated huge parts of your SoC to Graphics IP and change the memory subsystem to be wider. Which is what Strix Halo is doing.

But that's not a trade-off chipmakers will do. Because the vast majority of clients doesn't need such high speed graphics and the ones who need it are better supplied with a dGPU.
So you're only hitting a niche of budget gamers or the new found local LLM training users niche. That's a very small subset of users and they're best catered (if such niche is even worth appealing to) with a dedicated and expensive specialty SKU.

Therefore, iGPUs will basically become faster and wider by riding the logic and power improvements of newer nodes, increase in bandwidth due to modern memory standards and/or when necessity arises. Such as Intel and AMD widening the iGPU for them to make advantage of local low power AI workloads.

If we look at modern upcoming iGPUs, Strix Point is already using a 16 CU/1024 ALUs@2.9GHz. Panther Lake H is widening the GPU from MTL-H 8 Xe Cores/1024 ALUs@2.3 GHz to 12 Xe³ Cores/1536 ALUs@<=3GHz.

So in PTL-H case, you're already seeing a modern iGPU matching ALU count of Switch 2 and Series S GPUs, which are dedicated consoles. But at much higher clockspeeds and having generous amounts of cache and using a faster LPDDR5x standard than Switch 2 (PTL-H 8533 x Switch 2 7500MT/s). That should be good enough to surpass Switch 2 and Series S performance and a laptop with PTL-H would easily be able to run any current gen game released and yet to be released. That's good enough imo.
 
Last edited:

Doug S

Diamond Member
Feb 8, 2020
3,820
6,755
136
10nm was delayed by at least two years, and that's not counting the additional delays that Intel suffered even launching the aborted 10nm that went into Cannonlake. In reality the delay was more like 3-4 years before Intel even managed IceLake! There's simply no comparison.

It was worse than that. Intel roadmaps in 2013 showed 10nm chips being delivered in 2015, which later slipped to 2016. Then in summer 2015 they were pushed back to 2017, the start of a long line of delays until Intel finally shipped 10nm for real in Q4 2019. So it was delayed by at least four years, maybe a bit longer depending on exactly when in 2015 they originally roadmapped 10nm shipments.

To say it is nothing like TSMC's 3nm delays is a massive understatement.
 

poke01

Diamond Member
Mar 8, 2022
4,818
6,144
106
If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.
It’s because Apple for the first time ever is using HP libraries for their CPU1. This is FinFlex at work.

“TechInsights' analysis also revealed a hybrid library approach: UHD libraries for GPU and CPU2, and a new high-performance library for CPU1. This design optimizes for various computational demands within a unified architecture.”


This is what Intel also uses to achieve high clocks buts it burns thru power much much more on 10nm.
 

Hulk

Diamond Member
Oct 9, 1999
5,363
4,064
136
Doesn't an increase in IPC in a CPU also have to rely on the parallelism that can be extracted from the code? Wouldn't that put a limit on IPC? Or at the very least an exponential curve that has a limit?

Perhaps the "battleground" for performance will eventually have to shift from hardware to software eventually?

For example, if you have two video editors that are essentially equal in all things except performance and once runs twice as fast as the other on the same hardware then that is a "problem" for the software developers. Seems like hardware and software development must work hand-in-hand.

How much are you waiting on your computer these days?
I never wait for the following apps:
Chrome
Thunderbird
MS Office apps
Corel Draw

I am slowed down in my workflow by the following apps:
Vegas Video 21
Presonus Studio One 6
Photoshop
Topaz Photo AI (a little)
Topaz video AI (a LOT)
DxO PureRaw 3

Full disclosure, this is on my 14900K. On my Surface Laptop 2 many of these apps have me waiting quite a bit and some are unusable (Topaz AI).

But that is 8 cores of Raptor Cove at 5.5GHz vs 4 Skylake cores at less than 3GHz. Lunar Lake can't come fast enough for me.
 

DavidC1

Platinum Member
Dec 29, 2023
2,160
3,306
106
I'm not sure where those numbers for pipeline stages are coming from. I have huge doubts about the Apple M2's 9-cycle latency.

As for the memory subsystem latency, it's pretty similar for Apple, Intel, AMD, and others. L1 cache usually has 4-5 cycle latency, L2 - 16-20 cycles.

In general, there's no simple answer on how to achieve a better performance. The number of pipeline stages does not matter that much. Yes, I remember those debates around Pentium 3, Netburst, etc., but now the CPUs are much more complex.
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.
 
Last edited:
  • Like
Reactions: Nothingness