Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

511 · Jun 20, 2025

OneEng2 said:
I think that AMD's Zen 5c is close to or exceeds the perf/mm2 when on equal process nodes (I think).... at least for DC workloads.

Iirc Skymont is 1.1mm2 for the core+L1 and N3E 5C is 1.9mm2 on N3E

Fjodor2001 · Jun 20, 2025

OneEng2 said:
... and yes, in highly threaded DC applications, AMD's SMT give it ~ 40% uplift.

So you mean in a cherry-picked use case this is what MT uplift SMT can give at max?

But what’s more interesting is not the max for a cherry-picked use case, but the average for various MT use cases. And then it was only 18% according to the tests I linked to previously.

DavidC1 · Jun 21, 2025

Fjodor2001 said:
So you mean in a cherry-picked use case this is what MT uplift SMT can give at max?

AMD engineering being solid is what makes it better. That's why their theoretically inferior chiplet strategy works too.

Sandy Bridge was the last time they did well. Actually if you look at details of each SMT implementation, Intel always focused more on higher ST and losing less when less than max threads were used. Hence why "AMD gains more on SMT".

But since Sandy Bridge they were basically losing advantage. Because pretty much after that they lost focus trying to get into mobile. And then the 10nm debacle. A lot of us were disappointed by Ivy Bridge, because it was such a small gain. The only chip that benefitted from 22nm as hyped was Silvermont Atom. And they hyped their 22nm as being the best thus being able to address the mobile market.

511 · Jun 21, 2025

DavidC1 said:
AMD engineering being solid is what makes it better. That's why their theoretically inferior chiplet strategy works too.

Sandy Bridge was the last time they did well. Actually if you look at details of each SMT implementation, Intel always focused more on higher ST and losing less when less than max threads were used. Hence why "AMD gains more on SMT".

But since Sandy Bridge they were basically losing advantage. Because pretty much after that they lost focus trying to get into mobile. And then the 10nm debacle. A lot of us were disappointed by Ivy Bridge, because it was such a small gain. The only chip that benefitted from 22nm as hyped was Silvermont Atom. And they hyped their 22nm as being the best thus being able to address the mobile market.

Their 22nm was best at the time and it what made their design better not their design were good after that from the core team.

DavidC1 · Jun 21, 2025

511 said:
Their 22nm was best at the time and it what made their design better not their design were good after that from the core team.

They probably had to refocus on higher HP performance somewhere in the late Haswell generation because they would have quickly realized they wouldn't take significant mobile marketshare anytime soon, and even if they did, they will lose money for many years, while sacrificing HP transistors meant that it immediately hits your bread and butter line.

But Ivy Bridge was overall a disappointment. Later some Intel insiders said they didn't care about pushing uarch because "it wasn't necessary" since AMD was pushing out bulldozers.

Thunder 57 · Jun 21, 2025

DavidC1 said:
They probably had to refocus on higher HP performance somewhere in the late Haswell generation because they would have quickly realized they wouldn't take significant mobile marketshare anytime soon, and even if they did, they will lose money for many years, while sacrificing HP transistors meant that it immediately hits your bread and butter line.

But Ivy Bridge was overall a disappointment. Later some Intel insiders said they didn't care about pushing uarch because "it wasn't necessary" since AMD was pushing out bulldozers.

Ivy Bridge platform was what really improved. PCIe3, USB 3, Z chipset at launch.

deasd · Jun 21, 2025

Fjodor2001 said:
Also, does HT really add 40% MT perf?

511 said:
For HPC and many parallel workload yes.

Fjodor2001 said:
Here’s one test for Zen3 which says 18% MT perf increase on average for Zen3 with HT on vs off:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

Here’s another one for Ryzen AI 9 HX 370 that also says 18% geomean:

SMT Performance Benchmarks Continue To Show Benefit With AMD Zen 5/5C Review - Phoronix

www.phoronix.com

Please be aware that some of these benches are either not optimized enough for multi-core, or restricted by Amdahl's Law badly, or even aren't CPU bench alone.

Which means we need a single core + SMT w/o test to nullify the recession of the lack of multi-core optimization and Amdahl's Law, to see what SMT benefits.

here's a SPEC multi-thread test below taken from a guy that I forgot who he is, maybe DavidHuang's twitter? Test with both GoldenCove and Zen4 single core and SMT:

520_omnetpp_r is an exception case for both. The power usage doesn't go up when SMT on.

GoldenCove SMT provide 18.82% perf while Zen4 SMT is 26.30% at average.

511 · Jun 21, 2025

Zen5 was a different beast designed for SMT we need to benchmark that and wondering whether Intel will bring SMT back for server considering Panther Cove is split decode design like skymont.

Jan Olšan · Jun 21, 2025

Kepler_L2 said:
x86_64, HyperTransport (now part of UltraAcceleratorLink), GenZ (now part of CXL), Mantle (now DX12/Vulkan).

Wait wait wait, HyperTransport from Opterons? There's continuity going from it all the way to UALink?

Kepler_L2 · Jun 21, 2025

Jan Olšan said:
Wait wait wait, HyperTransport from Opterons? There's continuity going from it all the way to UALink?

Infinity Fabric is based on HyperTransport, and UALink is based on Infinity Fabric.

adroc_thurston · Jun 21, 2025

Jan Olšan said:
There's continuity going from it all the way to UALink?

Yeah IF is just a superset of hypertransport.

OneEng2 · Jun 21, 2025

511 said:
Iirc Skymont is 1.1mm2 for the core+L1 and N3E 5C is 1.9mm2 on N3E

Agree. Skymont is on N3B which is ~ 8-10% more dense than N3E IIRC. Which gives you about 1.2mm2 vs 1.9mm2 normalized to N3E.

If you add in SMT and AVX512 advantages in server workloads, a single Zen 5c core is about 1.5 times as potent in MT as a single Skymont which puts them more or less on par. It's fuzzy math for sure though as I don't have any benchmarks on actual Skymont server parts.

I think it is fair to assume that a 256 core Venice Zen 6c could be more than a match for a Darkmont 288c though.

AMDK11 · Jun 21, 2025

511 said:
Zen5 was a different beast designed for SMT we need to benchmark that and wondering whether Intel will bring SMT back for server considering Panther Cove is split decode design like skymont.

Where did you get the information that PantherCove has a cluster decoder?

OneEng2 · Jun 21, 2025

Seems like all modern x86 processors have a pretty wide decode. Skymont has a very elaborate front end and can decode 9 instructions per cycle. Zen 5 can decode 8 per cycle, but that is only using SMT in "dual core". For single thread, I believe Zen 5 is limited to 4 instructions per cycle.... which seems just fine as I don't think its execution path can do more than that anyway (my Zen 5 architecture is rusty

).

The real question I see is how does Skymont take advantage of 8 instructions per cycle without SMT?

It has been a thumb rule for some time that ILP over 2 is a minority. Over 3 is infrequent and over 4 almost never happens. These resources are then used for SMT instead.

Has that changed since I dug deep into it decades ago?

Doug S · Jun 21, 2025

OneEng2 said:
Seems like all modern x86 processors have a pretty wide decode. Skymont has a very elaborate front end and can decode 9 instructions per cycle. Zen 5 can decode 8 per cycle, but that is only using SMT in "dual core". For single thread, I believe Zen 5 is limited to 4 instructions per cycle.... which seems just fine as I don't think its execution path can do more than that anyway (my Zen 5 architecture is rusty ).

The real question I see is how does Skymont take advantage of 8 instructions per cycle without SMT?

It has been a thumb rule for some time that ILP over 2 is a minority. Over 3 is infrequent and over 4 almost never happens. These resources are then used for SMT instead.

Has that changed since I dug deep into it decades ago?

Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.

Io Magnesso · Jun 21, 2025

Doug S said:
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.

It doesn't make much sense if you can't make good use of that wide range.
However, being wide is not a bad thing.

511 · Jun 21, 2025

OneEng2 said:
Agree. Skymont is on N3B which is ~ 8-10% more dense than N3E IIRC. Which gives you about 1.2mm2 vs 1.9mm2 normalized to N3E.

4% denser

OneEng2 said:
If you add in SMT and AVX512 advantages in server workloads, a single Zen 5c core is about 1.5 times as potent in MT as a single Skymont which puts them more or less on par. It's fuzzy math for sure though as I don't have any benchmarks on actual Skymont server parts.

Normalized Zen5C is 1.6X more area than Skymont 😂(1.9/1.14) skymont wins just by a little bit

OneEng2 said:
I think it is fair to assume that a 256 core Venice Zen 6c could be more than a match for a Darkmont 288c though.

Yup

511 · Jun 21, 2025

AMDK11 said:
Where did you get the information that PantherCove has a cluster decoder?

It was leaked somewhere I am pretty sure I read it somewhere can't remember the source.

511 · Jun 21, 2025

Doug S said:
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.

AMD/Intel is spending large part of core area on Wider Vector units as well

AMDK11 · Jun 22, 2025

Is there anyone who can estimate approximately how many transistors are in L2, e.g. Skylake 256KB, Sunny/CypressCove 512KB, GldenCove 1.25MB-2MB, LionCove 3MB?

511 · Jun 22, 2025

AMDK11 said:
Is there anyone who can estimate approximately how many transistors are in L2, e.g. Skylake 256KB, Sunny/CypressCove 512KB, GldenCove 1.25MB-2MB, LionCove 3MB?

you can't tbh only intel knows this

AMDK11 · Jun 22, 2025

Damn. The only reference point is Skylake + L2 256KB - 217 million transistors and CypressCove(SunnyCove) + L2 512KB - 300 million transistors.

OneEng2 · Jun 22, 2025

Doug S said:
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.

Good point; however, I also think that these processors are nearly as big as a pie pan. Someone correct me if I am wrong.

SMT is all about efficient use of die space to maximize PPA in MT applications.

I also think that monolithic designs are a thing of the past.

511 said:
AMD/Intel is spending large part of core area on Wider Vector units as well

Very true, and their die size is still well below the M4 IIRC.

CouncilorIrissa · Jun 22, 2025

OneEng2 said:
Very true, and their die size is still well below the M4 IIRC.

No? Accouting for core-private caches only, M4's P-core is at ~3mm^2 and LNC is at 4.6mm^2.

I don't know where this myth of enormous Apple's cores comes from. They have large structures and they extract great perf/w from them, but they're not that large in terms of area. They're just *that good*.

511 · Jun 22, 2025

CouncilorIrissa said:
No? Accouting for core-private caches only, M4's P-core is at ~3mm^2 and LNC is at 4.6mm^2.
View attachment 126085

Well you should remove L2 and see cause Snapdragon and M4 are using shared L2.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member