Discussion Intel current and future Lakes & Rapids thread

Page 802 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Geddagod

Golden Member
Dec 28, 2021
1,524
1,620
106
Sounds like Intel is still having challenges with Chiplets. EMR is said to contain just 2: https://www.semianalysis.com/p/intel-emerald-rapids-backtracks-on

I find this to be interesting. It would explain a LOT of the rumors of the troubles surrounding Sapphire Rapids prior to launch and also the current rumors with Meteor Lake difficulties.

I guess I'm just curious why Intel is having such a hard time doing something that AMD has seemingly done so easily.
Pretty sure what Intel is trying to pull of with chiplets is harder than what AMD is doing since they need to maintain very low latency overheads since they want the whole chip to act as a giant monolithic one. Meaning they also have to work out the logistics of their giant mesh as well. IMO what AMD does is better but some people in this forum disagree. Either way though, I think Intel's method requires more design work.
I'm also pretty sure that some rumors were claiming SPR had trouble with getting their EMIB working properly.
EMR just having 2 chiplets was a bit shocking to me ig but in the end, Intel already makes these massive dies for their monolithic SPR models too, so why not just be able to package them into one giant chip if they are able to, right? That's my guess about why Intel went for it... plus I may have underestimated how expensive EMIB is. Idk.
MTL is doing MCM in a different way than SPR did, but Intel is supposedly facing issues there too.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
but my understanding is two units form and process the split data and another portion of data goes to the other two units
My understanding is that they use one 256b unit and perform the op over two cycles. Fusing together two units is what Intel has historically done.
from my very likely dated understanding now is that amd's implement is better than intel's because they manage to keep clocks higher and the temps lower when processing this data that calls for avx512 execution
Not really. Looking at the AVX execution in a vacuum, their solution is slower than the native width one Intel's been using. And there hasn't really been a clock penalty since Ice Lake years ago.

AMD's solution is area efficient, but it's not optimized for AVX speed or power.
what for? what the hell is the point? Up until rocket lake the only time i'd heard of avx512 was for machine learning and you would be better off with gpus for that. so now all of a sudden every jack terry and nigel want avx512 in their system
I do agree that it's being overhyped by a handful of people right now, coincidentally now that AMD has it and Intel does not, but AVX512 masking ops make it genuinely useful for a lot of things. As for AI, useful for low latency, batch size == 1 inference that are not worth dispatching to a GPU or other accelerator.
 

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
personally i belive him a lot, but regardless of my personal opinion, you should not compare this 2 guys, Tom is a computer hobbyist and he took his hobby to the podcast, while Nenni had high positions in several semiconductor firms and he took it to the Forum/podcast as a consultant to semi firms.

The two individuals have totally different intentions: MLID wants to reach a large audience, while Nenni wants to be the moderator of the next semi conductor symposium, one will be paid from game developer ads or patreon donations, while the other will be paid from semiconductor B2B ads.

Therefore, one of them will have to make bold and outrageous claims to attract the masses , and the other one will have to say those things that will attract the insiders of the industry.

The difference is huge!
But MLID spreads too much fake news about NVIDIA & Intel. Whereas, he spends way too much energy to showcase AMD in a good light even when it falls flat on the face at times!

He bleeds for team red. MLID is too biased.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
My understanding is that they use one 256b unit and perform the op over two cycles. Fusing together two units is what Intel has historically done.

Not really. Looking at the AVX execution in a vacuum, their solution is slower than the native width one Intel's been using. And there hasn't really been a clock penalty since Ice Lake years ago.

AMD's solution is area efficient, but it's not optimized for AVX speed or power.

I do agree that it's being overhyped by a handful of people right now, coincidentally now that AMD has it and Intel does not, but AVX512 masking ops make it genuinely useful for a lot of things. As for AI, useful for low latency, batch size == 1 inference that are not worth dispatching to a GPU or other accelerator.

AVX512 has lane-crossing instructions which pretty much makes it impossible to just split 512-bit instruction into two 256bit ones. Zen4 also does not do it, it have full 512-bit vector registers just like Intel. Zen3 have totally different FPU which have separated 256bit register files per fpu pipeline - which was reason it could not support AVX512. Older defination of cpu "bittness" was by size of its register files - and by that defination both Zen4 and Intel AVX512-fpu's are full 512 bit.

And fusing two units for AVX-512 unit is pretty much undoable - it's other way around - Intel 512 bit FPU pipelines can be split into two 256 bit independent instructions. Zen4 instead have independent 256 bit pipelines on both sides of its fpu register file.
 

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
I thought Intel 4 was MTL only and Intel 3 was full stack. In which case Intel would be manufacturing client CPU(SOCs) on their N3 node. Maybe we should start using IN3 or something to make it easier to distinguish from TSMC :D.
Ur right. Intel 4 has only HP libraries designed specifically for MTL to reach higher frequencies at the cost of more power. MTL may run a bit hotter and draw more power like RPL if it manages to reach higher frequencies (say > 5.5Ghz)

And yes, Intel 3 is full stack. But since it's the first ireration of IFS, it's not expected to be a huge money maker. All bets are on Intel 18A, their next full stack!
 
Last edited:

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
¯\_(ツ)_/¯ I'm not saying you have to believe it lol I'm just saying that it's Intel's roadmaps not anything I created.

Just curious, anyone know if the ARL 3nm rumors had them on N3 or the N3E process?
Also LNL development timeline (tape out q4 2022) puts it at the end of 2024 launch.
It's a shame Intel didn't make any announcement about ARL taping out, though they did for MTL and LNL.
Though when Pat mentioned having Intel 20A silicon tape outs already early 2023, it's possible he could be talking about having very, very early versions of ARL internally in the foundry. Intel not announcing ARL power on yet isn't exactly the best of signs, especially for a 1H 2024 launch, and if they haven't announced ARL powering on by Q2 earnings report, it should pretty much confirm a 1H 2024 launch is impossible and it's 2H.

As for Intel not announcing this stuff, with their finances looking the way they are, Intel is announcing pretty much every win they can get recently, even if it's internal development stuff.
I think MTL & RPL refresh is set to launch this Q3 (hopefully). Even RPL refresh won't launch b4 Q3 this year i guess.

Since there is no news about ARL tapeout, even a Q3 2024 launch seems a bit doubtful. So, lets forget Q1 2024 for now.

Actually, launching a product based on Intel 20A in Q1 2024 is very unlikely since Intel itself announced that only pdk 0.5 has been released for ARL. They (both design team & the node) are gonna need at least 6 months to finish tweaking the libraries and after that ARL needs a few steppings at least after the tape out and power on to go into manufacturing. So, ARL in Q3 2024 itself is difficult.

At best ARL is a Q3 2024 product with chances of slipping into early 2025.

Not much is know about the health of 18A, but I'm guessing LNL will need a few more months at least after ARL launch; considering it's a brand new architecture on a brand new node and it hasn't taped-out yet. That clearly puts it in Q1 2025 at best (but thats being too optimistic considering the lack of info).
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
AVX512 has lane-crossing instructions which pretty much makes it impossible to just split 512-bit instruction into two 256bit ones. Zen4 also does not do it, it have full 512-bit vector registers just like Intel. Zen3 have totally different FPU which have separated 256bit register files per fpu pipeline - which was reason it could not support AVX512. Older defination of cpu "bittness" was by size of its register files - and by that defination both Zen4 and Intel AVX512-fpu's are full 512 bit.

And fusing two units for AVX-512 unit is pretty much undoable - it's other way around - Intel 512 bit FPU pipelines can be split into two 256 bit independent instructions. Zen4 instead have independent 256 bit pipelines on both sides of its fpu register file.
The information I see says that AMD splits the 512b ops into two 256b ops issued to the same execution unit across two different cycles, which is not the same as running each half in parallel. They do seem to have a native 512b shuffle unit. Is that what you're thinking of?
 
  • Like
Reactions: Schmide

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Sounds like Intel is still having challenges with Chiplets. EMR is said to contain just 2: https://www.semianalysis.com/p/intel-emerald-rapids-backtracks-on

I find this to be interesting. It would explain a LOT of the rumors of the troubles surrounding Sapphire Rapids prior to launch and also the current rumors with Meteor Lake difficulties.

I guess I'm just curious why Intel is having such a hard time doing something that AMD has seemingly done so easily.
I think that take is mostly spin from the author. EMIB has a lot of overhead, both in power (vs monolithic) and die area. EMR is likely to fit in more "useful" silicon per wafer than SPR, even with the bigger die size. Plus, a year or so to further improve yields.
 

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
I think that take is mostly spin from the author. EMIB has a lot of overhead, both in power (vs monolithic) and die area. EMR is likely to fit in more "useful" silicon per wafer than SPR, even with the bigger die size. Plus, a year or so to further improve yields.
It appears Intel has started to ditch EMIB in favor of Foveros... finally! EMR has gained significant die area by ditching many of the EMIB controllers and they've reused the reclaimed die space to increase L2 from 1.87 MB to 5 MB which is huge! A fantastic move.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
The information I see says that AMD splits the 512b ops into two 256b ops issued to the same execution unit across two different cycles, which is not the same as running each half in parallel. They do seem to have a native 512b shuffle unit. Is that what you're thinking of?

Zen4 does not split 512 bit ops. It has full 512-bit vector registers so there's no need to split 512 bit ops. 512-bit op is executed through 256-bit execution pipelines in two clock cycles.
 
Jul 27, 2020
28,034
19,139
146
Excellent explanation by C&C: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/

It's pretty complicated, where AMD made different decisions at different points to save on transistor cost, unlike Intel who just forge ahead no matter what it costs.

In summary, AMD’s AVX-512 implementation focuses on better feeding their existing execution capacity, and only using additional die area and power where it’ll make the most impact. The most expensive change probably had to do with extending the vector register file to make each register 512-bits wide. AMD also had to add a mask register file, and other logic throughout the pipeline to handle the new instructions. Like Intel’s client implementations, AMD avoided adding extra floating point execution units, which would have been expensive. Unlike Intel, AMD also left L1D and L2 bandwidth unchanged, and split 512-bit stores into two operations.


The result is a very credible first round AVX-512 implementation. Compared to Intel, AMD still falls short in a few key areas, and is especially at a disadvantage if AVX-512 code demands a lot of load/store bandwidth and fits within core-private caches. But while Zen 4 doesn’t aim as high as Intel does, it still benefits from AVX-512 in many of the same ways that client Intel architectures do. AMD’s support for 512-bit vectors is also stronger than their initial support for 128-bit vectors in K8 Athlons, or 256-bit vectors from Bulldozer to Zen 1. Zen 4 should see clear benefits in applications that can take advantage of AVX-512, without spending a lot of power or die area.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
512-bit op is executed through 256-bit execution pipelines in two clock cycles.
That would, by necessity, mean splitting it...

But I think we're both in agreement on the fundamental reasons why this strategy wouldn't work for Intel's current Atom uarchs. There is no way for them to add in AVX512 support without substantial hardware investment.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
That would, by necessity, mean splitting it...

But I think we're both in agreement on the fundamental reasons why this strategy wouldn't work for Intel's current Atom uarchs. There is no way for them to add in AVX512 support without substantial hardware investment.

Absolutely not. Ops keep track what instruction is executed to which registers. If you split 512 bit instruction into two 256 bit instructions you need two ops to track those two instructions. Instead if execution hardware is not as wide as full register execution is just double pumped. It's not new thing, Z80 does it as does Pentium4 too.
 
Jul 27, 2020
28,034
19,139
146
There is no way for them to add in AVX512 support without substantial hardware investment.
While AMD's AVX512 work on Zen 4 has laid the groundwork for future AVX-512 support in their baby cores. Two birds with one stone. Their strategic planning is just more future oriented, less shortsighted and most efficient.
 
Jul 27, 2020
28,034
19,139
146
If you split 512 bit instruction into two 256 bit instructions you need two ops to track those two instructions.
Quoting from C&C:

Zen 4 partially breaks this tradition, by keeping instructions that work on 512-bit vectors as one micro-op throughout most of the pipeline. Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order execution buffers. I assume they’re broken up into two operations after they enter a 256-bit execution pipe, meaning that the instruction is split into two 256-bit halves as late as possible in the pipeline. I also assume that’s what AMD’s “double pumping” is referring to. Compared to Bulldozer and K8’s approach, this is a huge advantage.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
If you split 512 bit instruction into two 256 bit instructions you need two ops to track those two instructions.
Yes, and by all indications, that's what they're doing.
It's not new thing, MOS 6502 does it as does Pentium4 too.
This is not true "double pumping". As originally used, that term meant running part of the pipeline at twice the frequency as the rest. AMD is not doing that here. They are simply cracking a 512b op into two 256b components. And yes, there are certainly complications for the cross-lane interactions, but they're not unsolvable for a 2:1 split.
 
  • Like
Reactions: Schmide

Saylick

Diamond Member
Sep 10, 2012
4,036
9,456
136
I guess I'm just curious why Intel is having such a hard time doing something that AMD has seemingly done so easily.
My theory is that their chiplets woes are the result of the same cultural behavior as their 10nm woes: when Intel falls behind, they think they can just go super aggressive, introduce a bunch of features that they have never simultaneously used before, and engineer their way out. They gamble big but constantly come up short, because they need to realize that in the world of semiconductors, there are no shortcuts. The only way to win is to play the long game with calculated and extremely measured small steps at a time. This requires a roadmap that introduces new technology at the right time in the right amount. Without the roadmap Intel is just constantly chasing a moving target, and everytime they fall behind they think they can soothe investor sentiment by just move the goal post and slap on more promises at a later time.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
While AMD's AVX512 work on Zen 4 has laid the groundwork for future AVX-512 support in their baby cores. Two birds with one stone. Their strategic planning is just more future oriented, less shortsighted and most efficient.
AMD's small cores use the same uarch as their big ones, and most importantly for the Atom comparison, do not scale down as small as Atom does. It's an area efficient solution for the big core, but a liability for the small one. There's also the fact that the biggest market for small cores (cloud) has much less demand for strong vec compute.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
My theory is that their chiplets woes are the result of the same cultural behavior as their 10nm woes: when Intel falls behind, they think they can just go super aggressive, introduce a bunch of features that they have never simultaneously used before, and engineer their way out. They gamble big but constantly come up short, because they need to realize that in the world of semiconductors, there are no shortcuts. The only way to win is to play the long game with calculated and extremely measured small steps at a time. This requires a roadmap that introduces new technology at the right time in the right amount. Without the roadmap Intel is just constantly chasing a moving target, and everytime they fall behind they think they can soothe investor sentiment by just move the goal post and slap on more promises at a later time.
I feel like that theory works for many things Intel does, but SPR's specific use of EMIB does not seem one of them. They already had the tech working with FPGAs, and we really have no indication that EMIB has given them trouble from anything but the inherent power/latency overhead. I felt like we already did the math on this for SPR monolithic MCC. EMIB takes up a lot of die area, and if your yields can afford it, you'd be better off merging chips together and filling in that area with more cores.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,073
3,897
136
Yes, and by all indications, that's what they're doing.

This is not true "double pumping". As originally used, that term meant running part of the pipeline at twice the frequency as the rest. AMD is not doing that here. They are simply cracking a 512b op into two 256b components. And yes, there are certainly complications for the cross-lane interactions, but they're not unsolvable for a 2:1 split.
Your wrong and nukkas is right . All the terms your using mean something splitting , cracking etc and that is not what amd are doing.

This is why performance of execution units is measured in throughout over cycles.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Your wrong and nukkas is right . All the terms your using mean something splitting , cracking etc and that is not what amd are doing.

This is why performance of execution units is measured in throughout over cycles.
By all indications, AMD is taking two cycles to execute a 256b op, and they are not "double pumping" as Netburst did. If you have a source that AMD is splitting a 512b op across two separate execution units in the same cycle, please post it, because that contradicts everything I've heard from them and reviews (e.g. CaC above).
 

Saylick

Diamond Member
Sep 10, 2012
4,036
9,456
136
I feel like that theory works for many things Intel does, but SPR's specific use of EMIB does not seem one of them. They already had the tech working with FPGAs, and we really have no indication that EMIB has given them trouble from anything but the inherent power/latency overhead. I felt like we already did the math on this for SPR monolithic MCC. EMIB takes up a lot of die area, and if your yields can afford it, you'd be better off merging chips together and filling in that area with more cores.
I didn't mean to imply that EMIB as a technology was problematic, moreso that Intel's first commercial approach to server CPUs using chiplets or MCMs was heavy handed, or said in a way that aligns better with my previous post, the amount of EMIB used was aggressive. Ponte Vecchio is another good example of Intel's heavy-handedness as well. They saw that AMD was running with using a bunch of small dies making up a big die and they basically tried to take it to the next level, but it all seems like a case of, "Oh, if the competition is doing X, I'll just make sure I do 2X! That'll get me ahead!" without really making sure doing X is even the right approach for Intel, their products, or their architecture.