Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

igor_kavinski · Wednesday at 8:56 PM

poke01 said:
I see no problem with Intels client roadmap, to be honest it’s better than AMDs

Let's forget about laptop. What about Intel's client roadmap is better than AMD's?

poke01 · Wednesday at 9:27 PM

igor_kavinski said:
Let's forget about laptop. What about Intel's client roadmap is better than AMD's?

More SKUs for desktop that provide cheap nT and that’s pretty much it.

gdansk · Wednesday at 10:27 PM

poke01 said:
I see no problem with Intels client roadmap, to be honest it’s better than AMDs especially
for laptop.

AMD's client roadmap is especially lazy. But so far execution seems likely. With Intel, I do wonder if more will be left on the cutting room floor with these layoffs and LBT's promise to reduce the number of SKUs.

511 · Wednesday at 10:37 PM

poke01 said:
I see Panther lake being even better than ARL-h for battery related tasks. Can’t wait, not every thing is about IPC.

The fact that PTL doesn’t use Arrow lakes awful uncore and is similar to lunar lake is a massive plus.

It's iGPU is damm well better as well but 12Xe3 cores are going to starve with the limited bandwidth of only 128bit 8533X Mem Controller.

511 · Wednesday at 10:37 PM

duplicate

igor_kavinski · Wednesday at 10:44 PM

poke01 said:
More SKUs for desktop that provide cheap nT and that’s pretty much it.

I thought something had changed. It's always been that way. Intel has always provided more SKUs. AMD has some (like plain 7900) but they are always harder to find and may actually be priced higher by sellers than the X SKUs.

poke01 · Thursday at 2:43 AM

When will the first PTL GB listing arrive? It’s strange it hasn’t happened by now unless I missed it

511 · Thursday at 3:46 AM

poke01 said:
When will the first PTL GB listing arrive? It’s strange it hasn’t happened by now unless I missed it

I am expecting jaykhin to publish data he has the preliminary data and I asked him can he share but he said the data is not finalized so he can't share yet he is waiting for concrete number so I should expect somewhere end of August or September we can expect the leaks.

DavidC1 · Thursday at 8:41 AM

Doug S said:
I was honestly going to suggest the same thing. If you're going to do it, go all the way with it and support SMT4 or even SMT8 like IBM. The additional cost to "go bigger" is less than the cost to do it at all.

That's because IBM actually puts lots of effort into gaining more from SMT, unlike AMD/Intel. AMD/Intel's implementations are basically barebone, and adds barely 5% transistors to a core, nevermind the whole chip.

IBM's SMT added 25% extra transistors just for SMT. And those are careful, targetted improvements that requires a lot of thought and planning. They said without those improvements the gains would be a fraction. Without it the average is really only about 15%, they raised the average to ~40%.

IBM's chips are very different though. They are sold in big iron with high margins thus they can afford to do that plus very fancy things such as MCM packaging and very large eDRAM caches even 10 years ago. All optimized for Enterprise performance.

MS_AT · Thursday at 8:47 AM

DavidC1 said:
That's because IBM actually puts lots of effort into gaining more from SMT, unlike AMD/Intel. AMD/Intel's implementations are basically barebone, and adds barely 5% transistors to a core, nevermind the whole chip.

IBM's SMT added 25% extra transistors just for SMT. And those are careful, targetted improvements.

Do you have a link to an article / series of articles that explains what these improvements were. Do you know if they were motivated by the specific workloads IBM is facing?

DavidC1 · Thursday at 8:52 AM

MS_AT said:
Do you have a link to an article / series of articles that explains what these improvements were. Do you know if they were motivated by the specific workloads IBM is facing?

I updated my post.

Yes they are purely sold in big iron where they have very high margins. Everything they do is entirely optimized for Enterprise. It's not realistic for a PC chip that goes from 5W Tablet to 250W 6GHz enthusiast desktop.

DavidC1 · Thursday at 8:58 AM

OneEng2 said:
For some reason, AMD gets a much bigger boost from SMT than Intel .... which is ironic since Intel introduced it to x86 far before AMD got it.

SMT4 for Intel anyone?

It is minor differences in SMT implementations that caused the difference originally. If you compare Sandy Bridge generation, they have lot more shared resources versus Ryzen's version of SMT where more resources are distributed and per thread. So Intel wanted 2-3% better ST performance.

Intel P core architectures also share ports between EUs whereas AMD goes for distributed scheduler approach which reduces burden under SMT where more sharing is going on.

Chips and Cheese's analysis that Zen 5's clustered decoder is essentially a return to Bulldozer's CMT is not entirely off the mark. It probably eased and reduced execution risk by bringing parts of CMT to Zen architecture, hence why the decoders don't combine like in post-Tremont.

In fact there were rumors that sometime in the Skylake timeframe was where Intel would have had their own CMT architecture but much wider so the ST performance is actually good.

SMT focus is the wrong idea, especially because it increases validation time and risk on every generation that has one, and this in the long term is a loss. But CEOs are bound by other things such as keeping face.

OneEng2 · Thursday at 10:33 AM

DavidC1 said:
It is minor differences in SMT implementations that caused the difference originally. If you compare Sandy Bridge generation, they have lot more shared resources versus Ryzen's version of SMT where more resources are distributed and per thread. So Intel wanted 2-3% better ST performance.

Intel P core architectures also share ports between EUs whereas AMD goes for distributed scheduler approach which reduces burden under SMT where more sharing is going on.

Chips and Cheese's analysis that Zen 5's clustered decoder is essentially a return to Bulldozer's CMT is not entirely off the mark. It probably eased and reduced execution risk by bringing parts of CMT to Zen architecture, hence why the decoders don't combine like in post-Tremont.

In fact there were rumors that sometime in the Skylake timeframe was where Intel would have had their own CMT architecture but much wider so the ST performance is actually good.

SMT focus is the wrong idea, especially because it increases validation time and risk on every generation that has one, and this in the long term is a loss. But CEOs are bound by other things such as keeping face.

Interesting perspective that the more "CMT like" behavior makes it more effective in AMD's designs.

I'll have to agree to disagree on your final point though. SMT provides excellent PPA. Maximizing ST performance over MT is only important in a few high performance desktop applications and gaming. For gaming, lower latency memory architectures are proving to be way more important that wider and deeper CPU designs.

For financial success, DC is critical. Intel should adopt AMD's "Server First" approach to their central CPU design and let strategies like 3D memory or other tricks to get lower memory latency to boost ST performance.

I will say that the LPE concept is a really good one for laptop designs where battery life is key. Intel still has some really good ideas. I am just not so sure that the big/little concept was one of them or that dropping SMT was one of them.

LightningZ71 · Thursday at 10:53 AM

I read somewhere that IBM's SMT focus was to identify every region in the core that had resource contention, other than the main execution pipelines that are explicitly pooled and shared, and try to statically duplicate or partition them to reduce potential execution delays. As above, Intel shares a lot, AMD has a high percentage dedicated to partitioning and duplication, and IBM takes it to the extreme. Their designs are rather wide already, so execution contention on the back-end is more about maximizing resource utilization anyway. Remember that, for some applications, mainframes will use thread duplication/mirroring to assure that there are zero execution time errors. The end result is that, for some systems, even though they may have 64 threads available in a processor complex, only 32 of them are unique. They HAVE to go very wide to maintain acceptable performance with that strategy.

reb0rn · Thursday at 4:02 PM

To be fair maybe SMP work only for cloud load where each user use mostly non optimized code.... but for me that will use mostly one app to load 12+ core there is no benefit
Same it is for home user and windows where most app are very bad optimized, but with many cores now I do not see benefit big

For cloud if its used for VPS then sure, benefit of more threads vs cores will mostly be better, but for optimized load as encoding or AI not so

MS_AT · Thursday at 6:36 PM

reb0rn said:
To be fair maybe SMP work only for cloud load where each user use mostly non optimized code.... but for me that will use mostly one app to load 12+ core there is no benefit
Same it is for home user and windows where most app are very bad optimized, but with many cores now I do not see benefit big

For cloud if its used for VPS then sure, benefit of more threads vs cores will mostly be better, but for optimized load as encoding or AI not so

Well, HT/SMT exists to find ways to ensure backend resources are not idling. If somebody writes code with full backend utilisation in mind and knows what he/she is doing, then of course HT won't provide the benefit for this task, but code tuned to this degree is rarely encountered in the wild. From 1T workloads point of view it's probably better to have magical frontend like Apple M cores that gives you consistent high backend utilisation ratio, but SMT looks like next best thing.

At least one common workload, where SMT provides benefits on high core count machines is code compilation and I wouldn't call compilers badly optimised.

igor_kavinski · Thursday at 8:38 PM

Understanding SMT for everyone?

https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/amd-epyc-smt-technology-brief.pdf

OneEng2 · Thursday at 9:20 PM

igor_kavinski said:
Understanding SMT for everyone?

https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/amd-epyc-smt-technology-brief.pdf

Thanks!

I have been using a very old estimate of 15% added size to the core. Zen 4 and 5 SMT adds only 5% to the core.

Pretty darned good PPA getting 1.3-1.4x performance from a 5% die space increase!

Not sure WHAT Intel was thinking removing this?

OneEng2 · Thursday at 9:23 PM

MS_AT said:
Well, HT/SMT exists to find ways to ensure backend resources are not idling. If somebody writes code with full backend utilisation in mind and knows what he/she is doing, then of course HT won't provide the benefit for this task, but code tuned to this degree is rarely encountered in the wild. From 1T workloads point of view it's probably better to have magical frontend like Apple M cores that gives you consistent high backend utilisation ratio, but SMT looks like next best thing.

At least one common workload, where SMT provides benefits on high core count machines is code compilation and I wouldn't call compilers badly optimised.

Exactly this. Once super scaler designs hit the market, there were all these execution engines just sitting around most of the time waiting for those moments of maximum parallelable instructions to come along and use them.

For the other 90% of the time they would be left unused.

SMT took care of this by using them when they weren't being used.

Thunder 57 · Thursday at 9:33 PM

OneEng2 said:
Thanks!

I have been using a very old estimate of 15% added size to the core. Zen 4 and 5 SMT adds only 5% to the core.

Pretty darned good PPA getting 1.3-1.4x performance from a 5% die space increase!

Not sure WHAT Intel was thinking removing this?

That's been the estimate since the beginning.

At the same time, we think that the 32% boost we have seen isprobably the upper limit for multithreaded applications with the current implementation of Hyperthreading. A whileago, we found out that a second CPU (Athlon MP in this case) can push Kribi performance up to 81% higher, quite a bithigher. But it must be said that Hyperthreading's 20-32% boost is incredibly high, considering that it cost only 5% extradie space.

Source

igor_kavinski · Thursday at 9:53 PM

Thunder 57 said:
Source

P4 HT. Oh, what have you done to my fluttering heart!

My theory on why SMT4 would've been perfect for P4.

It's worst problem was branch mispredictions leading to pipeline stalls, right?

Well, if four threads are in various stages of execution, you can mispredict four times as much and still have at least one thread make it all the way to the end. And the high frequency would mean that not only did you mispredict more often, but it also increased the chances of keeping the entire core busy more easily with four threads in flight, preventing a complete start from scratch upon pipeline flush. SMT4 would've been the perfect companion to Tejas. And a hypothetical 15 GHz P4 could even properly utilize SMT8!

511 · Thursday at 10:19 PM

Cross Vendor XeSS

Intel® Xe Super Sampling 2 for Developers

Access technologies that dramatically boost your framerate at the highest visual quality while keeping your game responsive.

www.intel.com

DavidC1 · Thursday at 10:42 PM

igor_kavinski said:
P4 HT. Oh, what have you done to my fluttering heart!

My theory on why SMT4 would've been perfect for P4.

It's worst problem was branch mispredictions leading to pipeline stalls, right?

Well, if four threads are in various stages of execution, you can mispredict four times as much and still have at least one thread make it all the way to the end. And the high frequency would mean that not only did you mispredict more often, but it also increased the chances of keeping the entire core busy more easily with four threads in flight, preventing a complete start from scratch upon pipeline flush.

Pentium 4 did not have enough execution resources to take full advantage of SMT, nevermind SMT4 which requires IBM's level of focus to get advantage of it.

It's a 1-wide core helped by Trace Cache which was nowhere big enough to make up for the lack of issue width, double pumped but simple ALUs that can't run all instructions.

It isn't high frequency itself that causes the problem, but the steps taken to increase frequency such as simpler units and extra pipeline stages. Pentium 4 was so focused on clock speed it had a drive stage just to propagate signals farther and a replay feature which tried to make up for misprediction losses but bloated the core and starved the already starved core.

When it was introduced in Nehalem, most of the losses in low threaded applications that existed in P4 generations went to zero.

DavidC1 · Thursday at 10:45 PM

OneEng2 said:
I'll have to agree to disagree on your final point though. SMT provides excellent PPA. Maximizing ST performance over MT is only important in a few high performance desktop applications and gaming. For gaming, lower latency memory architectures are proving to be way more important that wider and deeper CPU designs.

You forget that SMT adds validation complexity. That increased difficulty not only causes potential delays but less focus on other parts of the core.

If you need extra 1 month per generation then in 10 generations that's nearly a year of delay. Nevermind increasingly sophisticated hacking and vulnerabilities which would further worsen this.

The Atom team was able to add out of order execution along with other improvements at the same size and they directly said it came from not having SMT like in the previous generation. So they went from 30% in occasional multi-threaded applications to 50% improvement in everything.

poke01 · Thursday at 10:53 PM

Did DavidC1 delete his account?

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Golden Member

Senior member

Platinum Member

Senior member

Senior member

Lifer

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Golden Member

Golden Member

Diamond Member