Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Tigerick · Aug 22, 2022

Wildcat Lake (WCL) Specs

Intel Wildcat Lake (WCL) is upcoming mobile SoC replacing Raptor Lake-U. WCL consists of 2 tiles: compute tile and PCD tile. It is true single die consists of CPU, GPU and NPU that is fabbed by 18-A process. Last time I checked, PCD tile is fabbed by TSMC N6 process. They are connected through UCIe, not D2D; a first from Intel. Expecting launching in Q1 2026.

	Intel Raptor Lake U	Intel Wildcat Lake 15W?	Intel Lunar Lake	Intel Panther Lake 4+0+4
Launch Date	Q1-2024	Q2-2026	Q3-2024	Q1-2026
Model	Intel 150U	Intel Core 7	Core Ultra 7 268V	Core Ultra 7 365
Dies	2	2	2	3
Node	Intel 7 + ?	Intel 18-A + TSMC N6	TSMC N3B + N6	Intel 18-A + Intel 3 + TSMC N6

CPU	2 P-core + 8 E-cores	2 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores	4 P-core + 4 LP E-cores
Threads	12	6	8	8
Max Clock	5.4 GHz	?	5 GHz	4.8 GHz
L3 Cache	12 MB		12 MB	12 MB
TDP	15 - 55 W	15 W ?	17 - 37 W	25 - 55 W

Memory	128-bit LPDDR5-5200	64-bit LPDDR5	128-bit LPDDR5x-8533	128-bit LPDDR5x-7467
Size	96 GB		32 GB	128 GB
Bandwidth			136 GB/s

GPU	Intel Graphics	Intel Graphics	Arc 140V	Intel Graphics
RT	No	No	YES	YES
EU / Xe	96 EU	2 Xe	8 Xe	4 Xe
Max Clock	1.3 GHz	?	2 GHz	2.5 GHz

NPU	GNA 3.0	18 TOPS	48 TOPS	49 TOPS

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Saylick · Jun 16, 2024

Henry swagger said:
Crestmont is better than bergamo. So you are wrong lol

Lol, no.

DrMrLordX · Jun 16, 2024

lightisgood said:
Intel's 10nm formerly was targeted for 2016-2017, however, transferred to 2017-18 by Krzanich afterward.

That is what is known as a "delay".

Anyway let's hope Intel doesn't "transfer" any more nodes into the distant future.

TwistedAndy · Jun 16, 2024

poke01 said:
Apple is boring because their cores are already good. To the point where Apple’s P core is more efficient and more powerful clock per clock than Skymont and LionCove.

Apple M1 was really good, but if we compare M4 to M1, the difference is nearly 8% on the same clock in Geekbench 5 (149.3% * 3.19 / 4.4 - 100% = 8.2%, link to results). If we use SPEC 2017, the IPC difference will be nearly the same (~11%).

If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.

As for power efficiency, it's the most misleading metric out there because the performance/power curve is not linear. There's a point of maximum efficiency, but device manufacturers often push the power limits higher to achieve a slight performance increase. Technically, if Apple decides to run M4 on 4 GHz instead of 4.4GHz, the power consumption will nearly match M3 (5-6W), but the performance will be only 7-8% higher in SPEC 2017.

As for Skymont, the "sweet spot" will be nearly 3-5W at 4.0-4.6 GHz. I don't think it makes sense for Intel to push it even higher, but who knows?

poke01 said:
Skymont is great because Crestmont sucked and it can no longer be considered an Atom core. We have yet to see how Lunar Lake performs in benchmarks and applications.

Crestmont is a further improvement to Gracemont, which was a huge step compared to Tremont. Skymont is another huge step forward.

Henry swagger said:
P core team need to take more ideas from Stevens's e core team.. to make the p core even more area efficient like skymont

Actually, they take some ideas from the E-cores (split INT/FP scheduler, wider retire, etc.), but it's very hard to refactor a large core. Also, Intel may decide to test some ideas on E-cores first before bringing them on P-cores.

Kosusko · Jun 16, 2024

poke01 said:
skymont has 26 execution ports more than M4-P core and X925. This is no longer a "e-core" but a "'middle" core.

CPU CORE MICROARCHITECTURE

• LITTLE ATOM (MONT) | Architecture All Access: Skymont E-core Microarchitecture Explained

• .big CORE (COVE) | Architecture All Access: Lion Cove P-core Microarchitecture Explained

Nothingness · Jun 16, 2024

TwistedAndy said:
From an architectural standpoint, Skymont was built as a P-core and is not so different from Apple's P-cores or ARM Cortex-X2.

It can decode 9 instructions per cycle using three non-blocking decoding clusters. If there's a complex instruction requiring microcode reading, the other decoding clusters are not blocked. Intel calls it "nanocode," and on paper, it looks nice.

Apple P-cores can decode from 8 (M1) to 10 (M4) instructions per cycle.

The backend was built to handle 8 micro-ops per cycle. It's the same amount as P-cores in Apple M1 and M2.

Also, Intel decided to increase the retirement capability to 16 micro-ops per cycle in Skymont. It allowed Intel to use various buffers, queues, and register files more efficiently and avoid increasing their size too much. For comparison, the P-cores in Apple M4 can retire 10 micro-ops per cycle.

Another interesting part is the number of execution ports. Skymont has 26 of them, including 8 ALU ports, 4 128b FP ports, 3 load/4 store AGU, 3 load, and 2 store ports.

For comparison, the P-core of Apple M4 allegedly has 8 ALU, 4 FP ports, 1 port for FMA, 3 load/2 store AGU, 3 load, and 2 store ports.

Yes, we can't directly compare architectures just by the width of the decoder, execution width, buffer sizes, and the number of ports, but it can give us a rough picture of the capabilities.

If all other parts of the Skymont were balanced well, we could get ISO performance similar to the P-cores in Apple M2.

Let's see when Skymont is out. As you wrote, width isn't the whole story and many "details" could make the width useless. But yeah from what we know it looks solid.

FWIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.

TwistedAndy · Jun 16, 2024

Nothingness said:
WIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.

Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.

That's the same number as what we have in Skymont (with the FMA support on all of them).

As for Lion Cove, it also has four FP ports, but they are wider (256b).

Nothingness · Jun 16, 2024

TwistedAndy said:
Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.

Proof here that there are 4 FMA: https://scalable.uni-jena.de/opt/sme/micro.html
M4 runs at 111 GFLOPS FP32: that is close the theoretical 4.4 GHz * 2 ops per FMA * 4 FP32 * 4 FMA = 141.

TwistedAndy said:
As for Lion Cove, it also has four FP ports, but they are wider (256b).

But only two FMA so that's the same FMA bandwidth as M1+.

SiliconFly · Jun 16, 2024

SiliconFly · Jun 16, 2024

SiliconFly · Jun 16, 2024

Magio · Jun 16, 2024

SiliconFly said:
Actually 20A is expected to be slightly better than N3 & 18A slightly better than N2.

It depends on what aspect of the process we're looking at, both 20A and 18A have GAAFET and backside power delivery, while only N2 has GAAFET and neither N3 nor N2 have BSPD, but transistor density is expected to remain better on the TSMC side.

Backside power and GAAFET are big innovations, but it would be reasonable to expect the first processes and chipmakers to ship products leveraging them to not fully exploit their potential right out of the gate, so TSMC's (likely) superior density on N2 might be worth more than Intel's BSPD on 18A for example.

SiliconFly · Jun 16, 2024

FlameTail · Jun 16, 2024

SiliconFly said:
Panther Lake = RTX 4050 sounds a bit too high. Maybe around 3050 Ti or a 3060 is my guess

Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

hemedans · Jun 16, 2024

FlameTail said:
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

Probably because they starve, no enough bandwidth.

TwistedAndy · Jun 16, 2024

FlameTail said:
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

There are many reasons. The most obvious one is the memory bandwidth.

CPU by itself does not require huge memory bandwidth. It should be good enough with a small latency.

For example, even in the Apple M3 Max, a single P-core can utilize ~120GB/s. All the cores can use nearly 240GB/s in total. It's a limitation of fabric, and that's fine.

On the other side, GPUs are not so sensitive to latency, but they need a lot of bandwidth. When we want to combine a CPU with a powerful GPU, we need to increase the number of memory channels. Even nVidia RTX xx50 performance requires at least a 256-bit memory bus. It does not make sense for most desktops and laptops.

For example, Strix Halo, with a pretty promising GPU, is expected to have a 256-bit bus with soldered memory.

Another problem is the cost and flexibility. There are not so many customers willing to pay much more to have xx50 class GPU on their CPU because they plan to use something more powerful (xx70, xx80, and xx90) from another vendor.

SiliconFly · Jun 16, 2024

SiliconFly · Jun 16, 2024

Ghostsonplanets · Jun 16, 2024

FlameTail said:
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

Yes? You want a <200mm² SoC part to match a dedicated 140mm² Graphics die fabbed on modern node, with modern uArch? Which also being very efficient, achieving high clocks and being supplied by dedicated memory and high amounts of bandwidth.

The SoC iGPU needs to share limited power budget, memory and bandwidth with the others parts of the SoC. The fact we have such big and reasonable fast iGPUs is already quite the feat.

To match a modern dedicated dGPU part, you need either to dedicated huge parts of your SoC to Graphics IP and change the memory subsystem to be wider. Which is what Strix Halo is doing.

But that's not a trade-off chipmakers will do. Because the vast majority of clients doesn't need such high speed graphics and the ones who need it are better supplied with a dGPU.
So you're only hitting a niche of budget gamers or the new found local LLM training users niche. That's a very small subset of users and they're best catered (if such niche is even worth appealing to) with a dedicated and expensive specialty SKU.

Therefore, iGPUs will basically become faster and wider by riding the logic and power improvements of newer nodes, increase in bandwidth due to modern memory standards and/or when necessity arises. Such as Intel and AMD widening the iGPU for them to make advantage of local low power AI workloads.

If we look at modern upcoming iGPUs, Strix Point is already using a 16 CU/1024 ALUs@2.9GHz. Panther Lake H is widening the GPU from MTL-H 8 Xe Cores/1024 ALUs@2.3 GHz to 12 Xe³ Cores/1536 ALUs@<=3GHz.

So in PTL-H case, you're already seeing a modern iGPU matching ALU count of Switch 2 and Series S GPUs, which are dedicated consoles. But at much higher clockspeeds and having generous amounts of cache and using a faster LPDDR5x standard than Switch 2 (PTL-H 8533 x Switch 2 7500MT/s). That should be good enough to surpass Switch 2 and Series S performance and a laptop with PTL-H would easily be able to run any current gen game released and yet to be released. That's good enough imo.

Doug S · Jun 16, 2024

DrMrLordX said:
10nm was delayed by at least two years, and that's not counting the additional delays that Intel suffered even launching the aborted 10nm that went into Cannonlake. In reality the delay was more like 3-4 years before Intel even managed IceLake! There's simply no comparison.

It was worse than that. Intel roadmaps in 2013 showed 10nm chips being delivered in 2015, which later slipped to 2016. Then in summer 2015 they were pushed back to 2017, the start of a long line of delays until Intel finally shipped 10nm for real in Q4 2019. So it was delayed by at least four years, maybe a bit longer depending on exactly when in 2015 they originally roadmapped 10nm shipments.

To say it is nothing like TSMC's 3nm delays is a massive understatement.

SiliconFly · Jun 16, 2024

Nothingness · Jun 16, 2024

Intel started having process issues at 14nm with bad yields and it took them time to make it work as expected.

Intel’s 14nm Technology in Detail

www.anandtech.com

poke01 · Jun 16, 2024

TwistedAndy said:
As for Skymont, the "sweet spot" will be nearly 3-5W at 4.0-4.6 GHz. I don't think it makes sense for Intel to push it even higher, but who knows?

Show me this or is this your imagination?

poke01 · Jun 16, 2024

TwistedAndy said:
If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.

It’s because Apple for the first time ever is using HP libraries for their CPU1. This is FinFlex at work.

“TechInsights' analysis also revealed a hybrid library approach: UHD libraries for GPU and CPU2, and a new high-performance library for CPU1. This design optimizes for various computational demands within a unified architecture.”

Introducing TSMC N3E: The Power Behind Apple's M4 SoC | TechInsights

In a recent teardown of the Apple iPad Pro 11-inch, TechInsights revealed details of Apple's latest silicon: the Apple M4 SoC, codenamed TMRV93, built on TSMC's advanced N3E process. This surprise release demonstrates Apple's agility in adopting cutting-edge semiconductor technologies ahead of...

www.techinsights.com

This is what Intel also uses to achieve high clocks buts it burns thru power much much more on 10nm.

Hulk · Jun 16, 2024

Doesn't an increase in IPC in a CPU also have to rely on the parallelism that can be extracted from the code? Wouldn't that put a limit on IPC? Or at the very least an exponential curve that has a limit?

Perhaps the "battleground" for performance will eventually have to shift from hardware to software eventually?

For example, if you have two video editors that are essentially equal in all things except performance and once runs twice as fast as the other on the same hardware then that is a "problem" for the software developers. Seems like hardware and software development must work hand-in-hand.

How much are you waiting on your computer these days?
I never wait for the following apps:
Chrome
Thunderbird
MS Office apps
Corel Draw

I am slowed down in my workflow by the following apps:
Vegas Video 21
Presonus Studio One 6
Photoshop
Topaz Photo AI (a little)
Topaz video AI (a LOT)
DxO PureRaw 3

Full disclosure, this is on my 14900K. On my Surface Laptop 2 many of these apps have me waiting quite a bit and some are unusable (Topaz AI).

But that is 8 cores of Raptor Cove at 5.5GHz vs 4 Skylake cores at less than 3GHz. Lunar Lake can't come fast enough for me.

DavidC1 · Jun 16, 2024

TwistedAndy said:
I'm not sure where those numbers for pipeline stages are coming from. I have huge doubts about the Apple M2's 9-cycle latency.

As for the memory subsystem latency, it's pretty similar for Apple, Intel, AMD, and others. L1 cache usually has 4-5 cycle latency, L2 - 16-20 cycles.

In general, there's no simple answer on how to achieve a better performance. The number of pipeline stages does not matter that much. Yes, I remember those debates around Pentium 3, Netburst, etc., but now the CPUs are much more complex.

This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads

Senior member

Attachments

Diamond Member

Lifer

Member

Senior member

Diamond Member

Member

Attachments

Diamond Member

Golden Member

Golden Member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member

Member

Golden Member

Golden Member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member