Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 107 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
216
238
76
PantherLake.png

LNL.png

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

MzSCXS6wm7kBMF9nXvBGRY.png

unmSFahCFp39WUfEjyuk7a.jpg

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)
 
Last edited:

AMDK11

Member
Jul 15, 2019
122
78
101
RedwoodCove Core:
Larger instruction cache: 64KB 16-way (Golden/RaptorCove L1-I 32KB 8-way), improved code prefetching, smarter code prefetching engine, improved branch prediction accuracy and reduced prediction miss penalty, microqueue options 192 for ST or 2x96for SMT (Golden/RaptorCove 144 ST/2x72 SMT), Better performance, lower FPU instruction latency up to 3 cycles instead of 4-5 (+25-40 more instructions) - AMX supports TF32 and FP16, new L2 design, more preferredLLC(L3) prefetch engine and higher throughput.
 
  • Like
Reactions: Henry swagger

Hulk

Diamond Member
Oct 9, 1999
3,913
1,587
136
Could GoldenCove be so close to perfectly exploiting the limits of 6-way x86 decoding?Even Zen 3 and Zen 4 with 4-way x86 decoding have largely comparable capabilities to GoldenCove.According to rumors, Zen 5 will also gain x86 6-way decoding.Isn't it possible that RedwoodCove, with the already known improvements, can gain even about 15% higher IPC?

Intel has a history (since Conroe) of improving either the front end or back end and then on the next architecture opening up the other end.

WARNING! Wild speculation ahead...

While there is obviously more to architecture than decoders and execution ports they are a good indication of what Intel is thinking as to balancing core throughput. Increases/improvements in other structures must also occur to kept the front/back end "fed."

Haswell/Broad well increased execution ports from 6 to 8.
Skylake added a decoder from 4 to 5, thus opening up the front end.
Sunny/Willow Cove added two more execution ports increasing the total to 10.
Golden Cove opened up both end adding two exe ports making the total 12 and adding another decoder bring that total to 6.

The fact that Intel increased the number of both execution ports and decoders in one architecture could mean that they feel as though the architecture is balanced, meaning meaningful gains will only come from making the core significantly wider or making some as yet unknown enhancements.

Or the core is plenty wide but currently held back by other structures such as buffers, registers, prefetch logic, micro-op cache, cache sizes, etc... or a combination thus fully utilizing the current Golden Cove front/back end.
 
  • Like
Reactions: Tlh97

Geddagod

Senior member
Dec 28, 2021
877
695
96
RedwoodCove Core:
Larger instruction cache: 64KB 16-way (Golden/RaptorCove L1-I 32KB 8-way), improved code prefetching, smarter code prefetching engine, improved branch prediction accuracy and reduced prediction miss penalty, microqueue options 192 for ST or 2x96for SMT (Golden/RaptorCove 144 ST/2x72 SMT), Better performance, lower FPU instruction latency up to 3 cycles instead of 4-5 (+25-40 more instructions) - AMX supports TF32 and FP16, new L2 design, more preferredLLC(L3) prefetch engine and higher throughput.
It appears GB 6 didn't read your post, because it didn't see much of an IPC increase at all
Larger instruction cache: 64KB 16-way (Golden/RaptorCove L1-I 32KB 8-way)
Prob the most interesting part of the new RWC arch
improved code prefetching, smarter code prefetching engine, improved branch prediction accuracy
I mean esentially every new microarch claims this lol. Infact, Raptor Cove claimed a "significantly optimized prefetching algorithim" as well.
microqueue options 192 for ST or 2x96for SMT (Golden/RaptorCove 144 ST/2x72 SMT)
Do not think this is any major structure that drastically increases IPC. GLC vs SNC, for example, barely increased it at all.
Better performance
I mean I hope so? lol
lower FPU instruction latency up to 3 cycles instead of 4-5 (+25-40 more instructions)
Could be big in some specific workloads ig, but this is only for FP Mult IIRC.
MX supports TF32 and FP16
That's server only
new L2 design
Not gonna bring much IPC
more preferredLLC(L3) prefetch engine and higher throughput.
L3 isn't considered part of the "core" IPC
Isn't it possible that RedwoodCove, with the already known improvements, can gain even about 15% higher IPC?
Just saw this edit, and it's possible, but we already have GB6 scores of RWC, doesn't look like it brings an IPC improvement.
 

AMDK11

Member
Jul 15, 2019
122
78
101
Haswell/Broad well increased execution ports from 6 to 8.
Skylake added a decoder from 4 to 5, thus opening up the front end.
Sunny/Willow Cove added two more execution ports increasing the total to 10.
Golden Cove opened up both end adding two exe ports making the total 12 and adding another decoder bring that total to 6.
Skylake and SunnyCove still have a 4-way decoder, and only GoldenCove got a 6-way one.
Intel claims that GoldenCove is a transition from 4-way to 6-way decoding.

There is no question of any 5-way decoding in SunnyCove or even Skylake.Many websites give incorrect data because Intel never specified it until GoldenCove.
 
Last edited:

AMDK11

Member
Jul 15, 2019
122
78
101
I mean esentially every new microarch claims this lol. Infact, Raptor Cove claimed a "significantly optimized prefetching algorithim" as well.
Isn't it related to the move from L2 1.25MB to 2MB?

EDIT:
Are we sure that GB6 shows the real and constant clock speed of RedwoodCove? Is there no bugs in the MeteorLake microcode like before the launch of RocketLake, whose results were very poor? At the current level, what do GB6 tell us about RedwoodCove's IPC?Is it lower? Comparable to GoldenCove?
 
Last edited:

Geddagod

Senior member
Dec 28, 2021
877
695
96
Isn't it related to the move from L2 1.25MB to 2MB?
Yup.
Are we sure that GB6 shows the real and constant clock speed of RedwoodCove?
I hacked into GB6's mainframes, yes.
I mean what do you want me to say? Lol.
Is there no bugs in the MeteorLake microcode like before the launch of RocketLake, whose results were very poor?
A microcode update did not magically fix rocket lake.
At the current level, what do GB6 tell us about RedwoodCove's IPC?Is it lower? Comparable to GoldenCove?
Esentially the same.

I'm just baffled at the people who are so deadset against believing RWC's low IPC gain. It makes complete sense for RWC to be esentially ported GLC, Intel always ports over a old arch to a new node. One might say Intel is trying to change that design schedule recently, but RWC and MTL were both designed at a time before those changes started happening (as RWC/MTL was originally supposed to be launched all the way back in late 2021/2022). It's not a hard to believe rumor.
 

Hulk

Diamond Member
Oct 9, 1999
3,913
1,587
136
Skylake and SunnyCove still have a 4-way decoder, and only GoldenCove got a 6-way one.
Intel claims that GoldenCove is a transition from 4-way to 6-way decoding.

There is no question of any 5-way decoding in SunnyCove or even Skylake.Many websites give incorrect data because Intel never specified it until GoldenCove.
Did Anandtech specify Skylake decoders as 4+1 back in the original review? I think that's where I got that idea.
 

AMDK11

Member
Jul 15, 2019
122
78
101
A microcode update did not magically fix rocket lake.
It fixed it. I remember screenshots from RocketLake's AIDA64 test before the pre-release patch showed that CypressCove's 8 cores were even slower than Skylake's 6 cores, which was absurd. After the release of the microcode patch, subsequent leaks from the AIDA64 test more or less showed what CypressCove ultimately was and a corresponding increase in IPC.


"Did Anandtech specify Skylake decoders as 4+1 back in the original review? I think that's where I got that idea."

No wonder, because diagrams of the GoldenCove core are still circulating on the Internet, showing 1+6 decoding, which is not true :D Intel dispelled doubts already at the premiere and providing details about GoldenCove.

SkyLake 4-Way decode x86
Sunny/CypressCove 4-Way decode x86
GoldenCove 6-Way decode x86
 
Last edited:

Henry swagger

Senior member
Feb 9, 2022
285
181
76
It fixed it. I remember screenshots from RocketLake's AIDA64 test before the pre-release patch showed that CypressCove's 8 cores were even slower than Skylake's 6 cores, which was absurd. After the release of the microcode patch, subsequent leaks from the AIDA64 test more or less showed what CypressCove ultimately was and a corresponding increase in IPC.


"Did Anandtech specify Skylake decoders as 4+1 back in the original review? I think that's where I got that idea."

No wonder, because diagrams of the GoldenCove core are still circulating on the Internet, showing 1+6 decoding, which is not true :D Intel dispelled doubts already at the premiere and providing details about GoldenCove.

SkyLake 4-Way decode x86
Sunny/CypressCove 4-Way decode x86
GoldenCove 6-Way decode x86
Golden cove has 5+1 decode the extra 1 is a complex decoder
 

Abwx

Lifer
Apr 2, 2011
10,405
2,945
136
Moving from 3-way to 4-way x86 decoding represents a 33% increase. Intel had 4-way decoding from Conroe (Core 2 - 2006) to SunnyCove (CypressCove - 2021), and that's about 15 years of development and a very large increase in IPC. Going from 4-way to 6-way x86 decoding in GoldenCove represents a 50% increase. Why can't we assume that a lot of IPC can be achieved with 6-way decoding?

Increasing the decoding bandwith in isolation will change nothing to a CPU IPC throughput, it s the whole pipeline that must be enhanced accordingly.

FI games typicaly yield barely 1 IPC, that s well below the decoders capabilities, so for such apps the limitations are not in the decoder.
 
  • Like
Reactions: Geddagod

Geddagod

Senior member
Dec 28, 2021
877
695
96
It fixed it. I remember screenshots from RocketLake's AIDA64 test before the pre-release patch showed that CypressCove's 8 cores were even slower than Skylake's 6 cores, which was absurd. After the release of the microcode patch, subsequent leaks from the AIDA64 test more or less showed what CypressCove ultimately was and a corresponding increase in IPC.
The hopium of the microcode update for RKL was improving the gaming performance of the chip, which it really didn't, at least not by any significant amount. People wanted it to at least be marginally ahead of comet lake in gaming perf, which ye, it didn't fix. RKL was still a mediocre product both before and after any microcode patches.

But if you want to believe another microcode update will bring RWC IPC 15% higher than GLC in GB6, despite there being no evidence or even logical speculation suggesting it to be so, go ahead, be my guest. I don't have a horse in this race lol, and who knows maybe you're right. We will see in literally 2 days ¯\_(ツ)_/¯
 

AMDK11

Member
Jul 15, 2019
122
78
101
Do not think this is any major structure that drastically increases IPC. GLC vs SNC, for example, barely increased it at all.
"microqueue options 192 for ST or 2x96for SMT (Golden/RaptorCove 144 ST/2x72 SMT)"

uop Queue:
Conroe
Single Thread 7+

Nehalem
Single Thread 28
SMT 2x 28

SandyBridge
Single Thread 28
SMT 2x 28

Haswell
Single Thread 56
SMT 2x 56

Skylake
Single Thread 64 (+14.3%)
SMT 2x 64 (+14.3%)

SunnyCove
Single Thread 70 (+9.3%)
SMT 2x 70 (+9.3%)

GoldenCove
Single Thread 144 (+105%)!
SMT 2x 72 (+2.8%)

RedwoodCove
SingleThread 192 (+33%)
SMT 2x 96 (+33%)
 
Last edited:

AMDK11

Member
Jul 15, 2019
122
78
101
Golden cove has 5+1 decode the extra 1 is a complex decoder
I wrote in general terms, without going into details, that Golden has a 6-way x86 decoder.In the GoldenCove diagram, Intel presents the decoder as 6-way and it is difficult to determine whether it is 6 simple or 6 complex.But maybe you're right and it's just like in previous generations 1+5.
 
  • Like
Reactions: Henry swagger

AMDK11

Member
Jul 15, 2019
122
78
101
Increasing the decoding bandwith in isolation will change nothing to a CPU IPC throughput, it s the whole pipeline that must be enhanced accordingly.

FI games typicaly yield barely 1 IPC, that s well below the decoders capabilities, so for such apps the limitations are not in the decoder.
I used a bit of a shortcut and approached it in general terms without being specific.I thought that expanding the core with a wider decoder would mean obvious changes to the rest of the core logic.As for me, I am perfectly aware of this.
 

AMDK11

Member
Jul 15, 2019
122
78
101
The hopium of the microcode update for RKL was improving the gaming performance of the chip, which it really didn't, at least not by any significant amount. People wanted it to at least be marginally ahead of comet lake in gaming perf, which ye, it didn't fix. RKL was still a mediocre product both before and after any microcode patches.

But if you want to believe another microcode update will bring RWC IPC 15% higher than GLC in GB6, despite there being no evidence or even logical speculation suggesting it to be so, go ahead, be my guest. I don't have a horse in this race lol, and who knows maybe you're right. We will see in literally 2 days ¯\_(ツ)_/¯
Games in general seem to be sensitive to the cache subsystem, RAM controller, etc. In this respect, RocketLake did not shine even though the L1 subsystem is very good and achieves very high throughput. Moreover, there are fewer cores, so in this respect it performed well is performing poorly. However, CypressCove itself sees a significant increase in IPC, and I was more concerned about the leaks from AIDA64, in which Rocketlake has still unfinished microcode, and the individual subscores for 8 Cypress cores were weaker than for 6 Skylake cores, which was actually the case not space. The official production microcode has already fixed this. What about games? Unfortunately, the 14nm process, smaller L3 cache and the trade-offs of much larger cores have made it perform unevenly, as well or worse than its predecessor in my opinion.
 

JoeRambo

Golden Member
Jun 13, 2013
1,767
1,989
136
Did Anandtech specify Skylake decoders as 4+1 back in the original review? I think that's where I got that idea.

With all respect to glorious AT past when they were more than marketing material parrots, they've mixed up between two things:

Haswell was able to output 4 uOps per cycle from 4 decoders and Skylake increased the output to "more than 5" uOps per cycle, but kept same 4 decoders, except AT misunderstood this i guess?

The elephant in the room with instruction decoders is instruction fetch, which on X86 is no joke as that's where predecode happens and instruction boundaries are found. Up to GLC it was 16 bytes, so even if core had 10 complex decoders, they would have sit completely idle due to having just 16 bytes to work with.
Now with 32bytes in GLC, we are back to having room for decode expansion, but even then it is more niuanced as we allow just 5.33 bytes per instruction on average, probably already nearing reality.

And as others mentioned this all focus on decode is not really relevant, as long as decode is as wide as other front end parts => so after "flush/mispredict blabla" situations you don't bottleneck on decode and fill at those 6 uOps per cycle.
Where performance will really matter -> in some tight loop or hot code section, LSD will handle small loops and uOp cache will help with others.
 

JoeRambo

Golden Member
Jun 13, 2013
1,767
1,989
136
Unfortunately, the 14nm process, smaller L3 cache and the trade-offs of much larger cores have made it perform unevenly, as well or worse than its predecessor in my opinion.

The real reason early samples performed so bad was memory subsystem taken from mobile CPU. They ran DDR4 at Gear2 mode designed for LPDDR4 in mobile. So a disaster from latency PoV. Early GB4 leaks looked like core had potential in subtasks that were not touching memory, but horrible where it was.
Intel somewhat fixed 11900K by letting it run GEAR1 mode, but the rest of lineup was still bad:


Perfectly balanced steaming pile. Bad at stock and for enthusiasts it was not able to run mem 1:1 > ~3600 speeds and was beaten by Comet Lake's 4000CL15 or so with ease.

Remember, performance nowadays mostly about memory subsystem even taking same architecture with same memory speed:

2MB of L2 vs 1.25MB of L2 and fixed L3 cache and uncore does that.

For Meteor Lake the jury is still out, the leaks we see are generated on some lovely LPDDR5 6400CL666 with 200ns of latency, so might not indicate what core could do on desktop or in hands of enthusiasts. ( but obviously i don't expect more than 5% IPC from what changes were revealed ).
 
  • Like
Reactions: Tlh97 and Executor_

Hulk

Diamond Member
Oct 9, 1999
3,913
1,587
136
Haswell was able to output 4 uOps per cycle from 4 decoders and Skylake increased the output to "more than 5" uOps per cycle, but kept same 4 decoders, except AT misunderstood this i guess?
I can't remember where I read this, but I had read that the only difference between Haswell and Skylake regarding decoders is that Skylake can decode some patterns that Haswell cannot. Is that info legit?
 

JoeRambo

Golden Member
Jun 13, 2013
1,767
1,989
136
I can't remember where I read this, but I had read that the only difference between Haswell and Skylake regarding decoders is that Skylake can decode some patterns that Haswell cannot. Is that info legit?

Both have 3 + 1 decoders, 3 simple and 1 complex. Complex ones are able to decode instructions that result into more than 1 uOp ( it's more complex than this ), so with all 4 working in a cycle there can be more more than 4uPs from decoders in a cycle => bottlenecked. Skylake relaxed this by allowing all work from decoders to be forwarded to uOP queue / LSD.
It's minor change, but performance increases are hard to come by :) Of historic note, for performance way more big deal was the fact that Intel had to disable LSD in Skylake due to bugs.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,096
10,264
136
Remember, performance nowadays mostly about memory subsystem even taking same architecture with same memory speed:

Starfield is a weird outlier, similar to other Gamebryo/Creation Engine games from Bethesda Softworks. We haven't seen those kind of performance deltas between Alder Lake and Raptor Lake in other titles.
 

JoeRambo

Golden Member
Jun 13, 2013
1,767
1,989
136
We haven't seen those kind of performance deltas between Alder Lake and Raptor Lake in other titles.

It was found that partially problem was cause by Windows messing up scheduling and using HT siblings on 12900K and not on Raptor. So perf delta between them shrank. And team "I disable HT since Comet Lake" rejoiced once more.
 

FangBLade

Member
Apr 13, 2022
190
373
96
Are there any more detailed pieces of information about the A.I. accelerator? And how does it compare to Phoenix, which the competition is using? I'm sure both companies have been working with Windows 12 in mind, which would require this for full functionality.
 

JoeRambo

Golden Member
Jun 13, 2013
1,767
1,989
136
14900KF is 15% faster than 7950X is ST & 20% faster in MT.

Caveat emptor warning: test was done on WSL2 on Linux version of GB6, with system that ran flat out 6Ghz and who knows what else tuned. Stock perf might not be representative at all. Customer beware.
Same performance can be achieved today with 13900KS with manual tuning, it's not really a news that such system is beating 7950X, the question is what voltage, cooling, chip binning and tuning investment such chip will require.
IF each 14900KF hits 6Ghz with ease, then awesome.
 

ASK THE COMMUNITY