Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 828 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
782
750
106
PPT1.jpg
PPT2.jpg
PPT3.jpg



As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



LNL-MX.png

Intel Core Ultra 100 - Meteor Lake

INTEL-CORE-100-ULTRA-METEOR-LAKE-OFFCIAL-SLIDE-2.jpg

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)



Clockspeed.png
 

Attachments

  • PantherLake.png
    PantherLake.png
    283.5 KB · Views: 24,025
  • LNL.png
    LNL.png
    881.8 KB · Views: 25,517
Last edited:

gdansk

Diamond Member
Feb 8, 2011
4,330
7,255
136
I see no problem with Intels client roadmap, to be honest it’s better than AMDs especially
for laptop.
AMD's client roadmap is especially lazy. But so far execution seems likely. With Intel, I do wonder if more will be left on the cutting room floor with these layoffs and LBT's promise to reduce the number of SKUs.
 

511

Diamond Member
Jul 12, 2024
3,240
3,176
106
I see Panther lake being even better than ARL-h for battery related tasks. Can’t wait, not every thing is about IPC.

The fact that PTL doesn’t use Arrow lakes awful uncore and is similar to lunar lake is a massive plus.
It's iGPU is damm well better as well but 12Xe3 cores are going to starve with the limited bandwidth of only 128bit 8533X Mem Controller.
 

511

Diamond Member
Jul 12, 2024
3,240
3,176
106
When will the first PTL GB listing arrive? It’s strange it hasn’t happened by now unless I missed it
I am expecting jaykhin to publish data he has the preliminary data and I asked him can he share but he said the data is not finalized so he can't share yet he is waiting for concrete number so I should expect somewhere end of August or September we can expect the leaks.
 

DavidC1

Golden Member
Dec 29, 2023
1,683
2,771
96
I was honestly going to suggest the same thing. If you're going to do it, go all the way with it and support SMT4 or even SMT8 like IBM. The additional cost to "go bigger" is less than the cost to do it at all.
That's because IBM actually puts lots of effort into gaining more from SMT, unlike AMD/Intel. AMD/Intel's implementations are basically barebone, and adds barely 5% transistors to a core, nevermind the whole chip.

IBM's SMT added 25% extra transistors just for SMT. And those are careful, targetted improvements that requires a lot of thought and planning. They said without those improvements the gains would be a fraction. Without it the average is really only about 15%, they raised the average to ~40%.

IBM's chips are very different though. They are sold in big iron with high margins thus they can afford to do that plus very fancy things such as MCM packaging and very large eDRAM caches even 10 years ago. All optimized for Enterprise performance.
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
776
1,561
96
That's because IBM actually puts lots of effort into gaining more from SMT, unlike AMD/Intel. AMD/Intel's implementations are basically barebone, and adds barely 5% transistors to a core, nevermind the whole chip.

IBM's SMT added 25% extra transistors just for SMT. And those are careful, targetted improvements.
Do you have a link to an article / series of articles that explains what these improvements were. Do you know if they were motivated by the specific workloads IBM is facing?
 

DavidC1

Golden Member
Dec 29, 2023
1,683
2,771
96
Do you have a link to an article / series of articles that explains what these improvements were. Do you know if they were motivated by the specific workloads IBM is facing?
I updated my post.

Yes they are purely sold in big iron where they have very high margins. Everything they do is entirely optimized for Enterprise. It's not realistic for a PC chip that goes from 5W Tablet to 250W 6GHz enthusiast desktop.
 

DavidC1

Golden Member
Dec 29, 2023
1,683
2,771
96
For some reason, AMD gets a much bigger boost from SMT than Intel .... which is ironic since Intel introduced it to x86 far before AMD got it.

SMT4 for Intel anyone?
It is minor differences in SMT implementations that caused the difference originally. If you compare Sandy Bridge generation, they have lot more shared resources versus Ryzen's version of SMT where more resources are distributed and per thread. So Intel wanted 2-3% better ST performance.

Intel P core architectures also share ports between EUs whereas AMD goes for distributed scheduler approach which reduces burden under SMT where more sharing is going on.

Chips and Cheese's analysis that Zen 5's clustered decoder is essentially a return to Bulldozer's CMT is not entirely off the mark. It probably eased and reduced execution risk by bringing parts of CMT to Zen architecture, hence why the decoders don't combine like in post-Tremont.

In fact there were rumors that sometime in the Skylake timeframe was where Intel would have had their own CMT architecture but much wider so the ST performance is actually good.

SMT focus is the wrong idea, especially because it increases validation time and risk on every generation that has one, and this in the long term is a loss. But CEOs are bound by other things such as keeping face.
 
Last edited:

OneEng2

Senior member
Sep 19, 2022
725
974
106
It is minor differences in SMT implementations that caused the difference originally. If you compare Sandy Bridge generation, they have lot more shared resources versus Ryzen's version of SMT where more resources are distributed and per thread. So Intel wanted 2-3% better ST performance.

Intel P core architectures also share ports between EUs whereas AMD goes for distributed scheduler approach which reduces burden under SMT where more sharing is going on.

Chips and Cheese's analysis that Zen 5's clustered decoder is essentially a return to Bulldozer's CMT is not entirely off the mark. It probably eased and reduced execution risk by bringing parts of CMT to Zen architecture, hence why the decoders don't combine like in post-Tremont.

In fact there were rumors that sometime in the Skylake timeframe was where Intel would have had their own CMT architecture but much wider so the ST performance is actually good.

SMT focus is the wrong idea, especially because it increases validation time and risk on every generation that has one, and this in the long term is a loss. But CEOs are bound by other things such as keeping face.
Interesting perspective that the more "CMT like" behavior makes it more effective in AMD's designs.

I'll have to agree to disagree on your final point though. SMT provides excellent PPA. Maximizing ST performance over MT is only important in a few high performance desktop applications and gaming. For gaming, lower latency memory architectures are proving to be way more important that wider and deeper CPU designs.

For financial success, DC is critical. Intel should adopt AMD's "Server First" approach to their central CPU design and let strategies like 3D memory or other tricks to get lower memory latency to boost ST performance.

I will say that the LPE concept is a really good one for laptop designs where battery life is key. Intel still has some really good ideas. I am just not so sure that the big/little concept was one of them or that dropping SMT was one of them.
 
  • Like
Reactions: Tlh97

LightningZ71

Platinum Member
Mar 10, 2017
2,370
2,991
136
I read somewhere that IBM's SMT focus was to identify every region in the core that had resource contention, other than the main execution pipelines that are explicitly pooled and shared, and try to statically duplicate or partition them to reduce potential execution delays. As above, Intel shares a lot, AMD has a high percentage dedicated to partitioning and duplication, and IBM takes it to the extreme. Their designs are rather wide already, so execution contention on the back-end is more about maximizing resource utilization anyway. Remember that, for some applications, mainframes will use thread duplication/mirroring to assure that there are zero execution time errors. The end result is that, for some systems, even though they may have 64 threads available in a processor complex, only 32 of them are unique. They HAVE to go very wide to maintain acceptable performance with that strategy.
 

reb0rn

Senior member
Dec 31, 2009
310
115
116
To be fair maybe SMP work only for cloud load where each user use mostly non optimized code.... but for me that will use mostly one app to load 12+ core there is no benefit
Same it is for home user and windows where most app are very bad optimized, but with many cores now I do not see benefit big

For cloud if its used for VPS then sure, benefit of more threads vs cores will mostly be better, but for optimized load as encoding or AI not so
 

MS_AT

Senior member
Jul 15, 2024
776
1,561
96
To be fair maybe SMP work only for cloud load where each user use mostly non optimized code.... but for me that will use mostly one app to load 12+ core there is no benefit
Same it is for home user and windows where most app are very bad optimized, but with many cores now I do not see benefit big

For cloud if its used for VPS then sure, benefit of more threads vs cores will mostly be better, but for optimized load as encoding or AI not so
Well, HT/SMT exists to find ways to ensure backend resources are not idling. If somebody writes code with full backend utilisation in mind and knows what he/she is doing, then of course HT won't provide the benefit for this task, but code tuned to this degree is rarely encountered in the wild. From 1T workloads point of view it's probably better to have magical frontend like Apple M cores that gives you consistent high backend utilisation ratio, but SMT looks like next best thing.

At least one common workload, where SMT provides benefits on high core count machines is code compilation and I wouldn't call compilers badly optimised.
 

OneEng2

Senior member
Sep 19, 2022
725
974
106

OneEng2

Senior member
Sep 19, 2022
725
974
106
Well, HT/SMT exists to find ways to ensure backend resources are not idling. If somebody writes code with full backend utilisation in mind and knows what he/she is doing, then of course HT won't provide the benefit for this task, but code tuned to this degree is rarely encountered in the wild. From 1T workloads point of view it's probably better to have magical frontend like Apple M cores that gives you consistent high backend utilisation ratio, but SMT looks like next best thing.

At least one common workload, where SMT provides benefits on high core count machines is code compilation and I wouldn't call compilers badly optimised.
Exactly this. Once super scaler designs hit the market, there were all these execution engines just sitting around most of the time waiting for those moments of maximum parallelable instructions to come along and use them.

For the other 90% of the time they would be left unused.

SMT took care of this by using them when they weren't being used.
 

Thunder 57

Diamond Member
Aug 19, 2007
3,835
6,478
136
Thanks!

I have been using a very old estimate of 15% added size to the core. Zen 4 and 5 SMT adds only 5% to the core.

Pretty darned good PPA getting 1.3-1.4x performance from a 5% die space increase!

Not sure WHAT Intel was thinking removing this?

That's been the estimate since the beginning.

At the same time, we think that the 32% boost we have seen isprobably the upper limit for multithreaded applications with the current implementation of Hyperthreading. A whileago, we found out that a second CPU (Athlon MP in this case) can push Kribi performance up to 81% higher, quite a bithigher. But it must be said that Hyperthreading's 20-32% boost is incredibly high, considering that it cost only 5% extradie space.

Source
 
Jul 27, 2020
26,461
18,190
146
P4 HT. Oh, what have you done to my fluttering heart!

My theory on why SMT4 would've been perfect for P4.

It's worst problem was branch mispredictions leading to pipeline stalls, right?

Well, if four threads are in various stages of execution, you can mispredict four times as much and still have at least one thread make it all the way to the end. And the high frequency would mean that not only did you mispredict more often, but it also increased the chances of keeping the entire core busy more easily with four threads in flight, preventing a complete start from scratch upon pipeline flush. SMT4 would've been the perfect companion to Tejas. And a hypothetical 15 GHz P4 could even properly utilize SMT8!
 

DavidC1

Golden Member
Dec 29, 2023
1,683
2,771
96
P4 HT. Oh, what have you done to my fluttering heart!

My theory on why SMT4 would've been perfect for P4.

It's worst problem was branch mispredictions leading to pipeline stalls, right?

Well, if four threads are in various stages of execution, you can mispredict four times as much and still have at least one thread make it all the way to the end. And the high frequency would mean that not only did you mispredict more often, but it also increased the chances of keeping the entire core busy more easily with four threads in flight, preventing a complete start from scratch upon pipeline flush.
Pentium 4 did not have enough execution resources to take full advantage of SMT, nevermind SMT4 which requires IBM's level of focus to get advantage of it.

It's a 1-wide core helped by Trace Cache which was nowhere big enough to make up for the lack of issue width, double pumped but simple ALUs that can't run all instructions.

It isn't high frequency itself that causes the problem, but the steps taken to increase frequency such as simpler units and extra pipeline stages. Pentium 4 was so focused on clock speed it had a drive stage just to propagate signals farther and a replay feature which tried to make up for misprediction losses but bloated the core and starved the already starved core.

When it was introduced in Nehalem, most of the losses in low threaded applications that existed in P4 generations went to zero.
 

DavidC1

Golden Member
Dec 29, 2023
1,683
2,771
96
I'll have to agree to disagree on your final point though. SMT provides excellent PPA. Maximizing ST performance over MT is only important in a few high performance desktop applications and gaming. For gaming, lower latency memory architectures are proving to be way more important that wider and deeper CPU designs.
You forget that SMT adds validation complexity. That increased difficulty not only causes potential delays but less focus on other parts of the core.

If you need extra 1 month per generation then in 10 generations that's nearly a year of delay. Nevermind increasingly sophisticated hacking and vulnerabilities which would further worsen this.

The Atom team was able to add out of order execution along with other improvements at the same size and they directly said it came from not having SMT like in the previous generation. So they went from 30% in occasional multi-threaded applications to 50% improvement in everything.