Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 811 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
762
718
106
PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



LNL-MX.png

Intel Core Ultra 100 - Meteor Lake

INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)



Clockspeed.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,024
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,515
Last edited:

Fjodor2001

Diamond Member
Feb 6, 2010
4,092
506
126
... and yes, in highly threaded DC applications, AMD's SMT give it ~ 40% uplift.
So you mean in a cherry-picked use case this is what MT uplift SMT can give at max?

But what’s more interesting is not the max for a cherry-picked use case, but the average for various MT use cases. And then it was only 18% according to the tests I linked to previously.
 

DavidC1

Golden Member
Dec 29, 2023
1,546
2,550
96
So you mean in a cherry-picked use case this is what MT uplift SMT can give at max?
AMD engineering being solid is what makes it better. That's why their theoretically inferior chiplet strategy works too.

Sandy Bridge was the last time they did well. Actually if you look at details of each SMT implementation, Intel always focused more on higher ST and losing less when less than max threads were used. Hence why "AMD gains more on SMT".

But since Sandy Bridge they were basically losing advantage. Because pretty much after that they lost focus trying to get into mobile. And then the 10nm debacle. A lot of us were disappointed by Ivy Bridge, because it was such a small gain. The only chip that benefitted from 22nm as hyped was Silvermont Atom. And they hyped their 22nm as being the best thus being able to address the mobile market.
 

511

Platinum Member
Jul 12, 2024
2,510
2,334
106
AMD engineering being solid is what makes it better. That's why their theoretically inferior chiplet strategy works too.

Sandy Bridge was the last time they did well. Actually if you look at details of each SMT implementation, Intel always focused more on higher ST and losing less when less than max threads were used. Hence why "AMD gains more on SMT".

But since Sandy Bridge they were basically losing advantage. Because pretty much after that they lost focus trying to get into mobile. And then the 10nm debacle. A lot of us were disappointed by Ivy Bridge, because it was such a small gain. The only chip that benefitted from 22nm as hyped was Silvermont Atom. And they hyped their 22nm as being the best thus being able to address the mobile market.
Their 22nm was best at the time and it what made their design better not their design were good after that from the core team.
 
  • Like
Reactions: Io Magnesso

DavidC1

Golden Member
Dec 29, 2023
1,546
2,550
96
Their 22nm was best at the time and it what made their design better not their design were good after that from the core team.
They probably had to refocus on higher HP performance somewhere in the late Haswell generation because they would have quickly realized they wouldn't take significant mobile marketshare anytime soon, and even if they did, they will lose money for many years, while sacrificing HP transistors meant that it immediately hits your bread and butter line.

But Ivy Bridge was overall a disappointment. Later some Intel insiders said they didn't care about pushing uarch because "it wasn't necessary" since AMD was pushing out bulldozers.
 

Thunder 57

Diamond Member
Aug 19, 2007
3,747
6,333
136
They probably had to refocus on higher HP performance somewhere in the late Haswell generation because they would have quickly realized they wouldn't take significant mobile marketshare anytime soon, and even if they did, they will lose money for many years, while sacrificing HP transistors meant that it immediately hits your bread and butter line.

But Ivy Bridge was overall a disappointment. Later some Intel insiders said they didn't care about pushing uarch because "it wasn't necessary" since AMD was pushing out bulldozers.

Ivy Bridge platform was what really improved. PCIe3, USB 3, Z chipset at launch.
 
  • Like
Reactions: Io Magnesso

deasd

Senior member
Dec 31, 2013
598
1,026
136
Also, does HT really add 40% MT perf?
For HPC and many parallel workload yes.
Here’s one test for Zen3 which says 18% MT perf increase on average for Zen3 with HT on vs off:

Here’s another one for Ryzen AI 9 HX 370 that also says 18% geomean:

Please be aware that some of these benches are either not optimized enough for multi-core, or restricted by Amdahl's Law badly, or even aren't CPU bench alone.

Which means we need a single core + SMT w/o test to nullify the recession of the lack of multi-core optimization and Amdahl's Law, to see what SMT benefits.

here's a SPEC multi-thread test below taken from a guy that I forgot who he is, maybe DavidHuang's twitter? Test with both GoldenCove and Zen4 single core and SMT:


F6DzkQiXsAEJUnp.png


F6DzgpUXEAA1XyA.png

520_omnetpp_r is an exception case for both. The power usage doesn't go up when SMT on.

GoldenCove SMT provide 18.82% perf while Zen4 SMT is 26.30% at average.
 

OneEng2

Senior member
Sep 19, 2022
640
872
106
Iirc Skymont is 1.1mm2 for the core+L1 and N3E 5C is 1.9mm2 on N3E
Agree. Skymont is on N3B which is ~ 8-10% more dense than N3E IIRC. Which gives you about 1.2mm2 vs 1.9mm2 normalized to N3E.

If you add in SMT and AVX512 advantages in server workloads, a single Zen 5c core is about 1.5 times as potent in MT as a single Skymont which puts them more or less on par. It's fuzzy math for sure though as I don't have any benchmarks on actual Skymont server parts.

I think it is fair to assume that a 256 core Venice Zen 6c could be more than a match for a Darkmont 288c though.
 

OneEng2

Senior member
Sep 19, 2022
640
872
106
Seems like all modern x86 processors have a pretty wide decode. Skymont has a very elaborate front end and can decode 9 instructions per cycle. Zen 5 can decode 8 per cycle, but that is only using SMT in "dual core". For single thread, I believe Zen 5 is limited to 4 instructions per cycle.... which seems just fine as I don't think its execution path can do more than that anyway (my Zen 5 architecture is rusty ;) ).

The real question I see is how does Skymont take advantage of 8 instructions per cycle without SMT?

It has been a thumb rule for some time that ILP over 2 is a minority. Over 3 is infrequent and over 4 almost never happens. These resources are then used for SMT instead.

Has that changed since I dug deep into it decades ago?
 

Doug S

Diamond Member
Feb 8, 2020
3,216
5,535
136
Seems like all modern x86 processors have a pretty wide decode. Skymont has a very elaborate front end and can decode 9 instructions per cycle. Zen 5 can decode 8 per cycle, but that is only using SMT in "dual core". For single thread, I believe Zen 5 is limited to 4 instructions per cycle.... which seems just fine as I don't think its execution path can do more than that anyway (my Zen 5 architecture is rusty ;) ).

The real question I see is how does Skymont take advantage of 8 instructions per cycle without SMT?

It has been a thumb rule for some time that ILP over 2 is a minority. Over 3 is infrequent and over 4 almost never happens. These resources are then used for SMT instead.

Has that changed since I dug deep into it decades ago?

Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.
 

Io Magnesso

Member
Jun 12, 2025
138
51
56
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.
It doesn't make much sense if you can't make good use of that wide range.
However, being wide is not a bad thing.
 
Last edited:

511

Platinum Member
Jul 12, 2024
2,510
2,334
106
Agree. Skymont is on N3B which is ~ 8-10% more dense than N3E IIRC. Which gives you about 1.2mm2 vs 1.9mm2 normalized to N3E.
4% denser
If you add in SMT and AVX512 advantages in server workloads, a single Zen 5c core is about 1.5 times as potent in MT as a single Skymont which puts them more or less on par. It's fuzzy math for sure though as I don't have any benchmarks on actual Skymont server parts.
Normalized Zen5C is 1.6X more area than Skymont 😂(1.9/1.14) skymont wins just by a little bit
I think it is fair to assume that a 256 core Venice Zen 6c could be more than a match for a Darkmont 288c though.
Yup
 

511

Platinum Member
Jul 12, 2024
2,510
2,334
106
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.
AMD/Intel is spending large part of core area on Wider Vector units as well
 

OneEng2

Senior member
Sep 19, 2022
640
872
106
Apple has a 10 wide decode - and M4 increased it to 10 from the M3's 9 wide and M2's 8 wide, and they have no SMT. Clearly their designers are seeing a reason to add that width, and given that M4 is faster than anything AMD or Intel ships in raw performance let alone when comparing by IPC it sure looks like its working to me.

So it seems pretty certain that regardless of SMT Intel could take advantage of 8 wide decode. How Skymont actually performs is another matter but if it doesn't improve IPC over its predecessor it won't be because they went "too wide". It will be because they neglected some other part of the architecture that's necessary to take advantage of that added width when it is possible. Because being 8 or 10 wide isn't about keeping 8 or 10 units busy every cycle, it is about that "when it is possible" - and making other changes to the architecture to make "when it is possible" happen more often.
Good point; however, I also think that these processors are nearly as big as a pie pan. Someone correct me if I am wrong.

SMT is all about efficient use of die space to maximize PPA in MT applications.

I also think that monolithic designs are a thing of the past.
AMD/Intel is spending large part of core area on Wider Vector units as well
Very true, and their die size is still well below the M4 IIRC.
 

CouncilorIrissa

Senior member
Jul 28, 2023
649
2,512
96
Very true, and their die size is still well below the M4 IIRC.
No? Accouting for core-private caches only, M4's P-core is at ~3mm^2 and LNC is at 4.6mm^2.
1750617582962.png

I don't know where this myth of enormous Apple's cores comes from. They have large structures and they extract great perf/w from them, but they're not that large in terms of area. They're just *that good*.
 
Last edited:
  • Like
Reactions: Tlh97 and DavidC1