Discussion Intel current and future Lakes & Rapids thread

moinmoin · May 3, 2023

Exist50 said:
AMD's small cores use the same uarch as their big ones, and most importantly for the Atom comparison, do not scale down as small as Atom does. It's an area efficient solution for the big core, but a liability for the small one. There's also the fact that the biggest market for small cores (cloud) has much less demand for strong vec compute.

Do you happen to know the market share of Atom servers?

Exist50 · May 3, 2023

moinmoin said:
Do you happen to know the market share of Atom servers?

For the market that Sierra Forest and Bergamo will be competing in, nothing right now. Though obviously Intel has a few networking chips that I suppose could count.

For a comparison point, best to look at the ARM vendors, because they're realistically why these small core solutions exist in the first place. Graviton has been a pretty big driver for Amazon, even if they're still majority x86.

A/// · May 3, 2023

Did amazon release any white papers on their in house processor design? does anyone know what comes after granite rapids for w class xeons?

igor_kavinski · May 4, 2023

Exist50 said:
Perhaps, but what would you see as driving significantly more vector compete, but with lower perf/thread?

Maybe the VNNI instruction being used extensively for AI workloads?

Exist50 · May 4, 2023

igor_kavinski said:
Maybe the VNNI instruction being used extensively for AI workloads?

I would assume the small batch inferencing that you'd want a CPU for have a significant affinity with higher perf/thread (i.e. the big cores). And certainly the smaller cache sizes would hurt a lot, given how memory bandwidth intensive most heavy AVX use cases are.

I'm not going to say that a niche for strong vector compute on the small cores doesn't exist, but all in all, it doesn't seem like the ideal tradeoff for how AMD or Intel are positioning their small core offerings.

A/// · May 4, 2023

Exist50 said:
AMD's small cores use the same uarch as their big ones,

What small cores? When did AMD release small and big core processors?

eek2121 · May 4, 2023

SiliconFly said:
Ur right. Intel 4 has only HP libraries designed specifically for MTL to reach higher frequencies at the cost of more power. MTL may run a bit hotter and draw more power like RPL if it manages to reach higher frequencies (say > 5.5Ghz)

And yes, Intel 3 is full stack. But since it's the first ireration of IFS, it's not expected to be a huge money maker. All bets are on Intel 18A, their next full stack!

Unless Intel botches everything (likely), Meteor Lake appears to be targeting efficient, low power designs. No 5.5ghz to be found. The evidence I’ve seen indicates these chips will be in premium ultrabooks and such. Think 14 cores (6+8) with a 10-20 hour battery life in a sub 2lb laptop.

SiliconFly said:
I think MTL & RPL refresh is set to launch this Q3 (hopefully). Even RPL refresh won't launch b4 Q3 this year i guess.

Since there is no news about ARL tapeout, even a Q3 2024 launch seems a bit doubtful. So, lets forget Q1 2024 for now.

Actually, launching a product based on Intel 20A in Q1 2024 is very unlikely since Intel itself announced that only pdk 0.5 has been released for ARL. They (both design team & the node) are gonna need at least 6 months to finish tweaking the libraries and after that ARL needs a few steppings at least after the tape out and power on to go into manufacturing. So, ARL in Q3 2024 itself is difficult.

At best ARL is a Q3 2024 product with chances of slipping into early 2025.

Not much is know about the health of 18A, but I'm guessing LNL will need a few more months at least after ARL launch; considering it's a brand new architecture on a brand new node and it hasn't taped-out yet. That clearly puts it in Q1 2025 at best (but thats being too optimistic considering the lack of info).

One need only look at Intel’s launch cadence to know Arrow Lake is launching Q4.

eek2121 · May 4, 2023

A/// said:
What small cores? When did AMD release small and big core processors?

The Zen 4 cloud cores people here keep claiming will magically appear in desktop/mobile chips. 🤣

igor_kavinski · May 4, 2023

eek2121 said:
The Zen 4 cloud cores people here keep claiming will magically appear in desktop/mobile chips. 🤣

They could, if they get thousands of leaky cores

naukkis · May 4, 2023

Exist50 said:
By all indications, AMD is taking two cycles to execute a 256b op, and they are not "double pumping" as Netburst did. If you have a source that AMD is splitting a 512b op across two separate execution units in the same cycle, please post it, because that contradicts everything I've heard from them and reviews (e.g. CaC above).

Don't confuse Netburst double interger alu clock here. Pentium4 also double pumped both integer and fpu instructions - Willamette/Northwood P4 had only 16 bit integer alu and 64 bit FPU alu. Both 32 bit integer and 128bit SSE instructions are executed double pumping instruction twice in execution units, just like what AMD is doing with Zen4.

naukkis · May 4, 2023

igor_kavinski said:
Quoting from C&C:
Zen 4 partially breaks this tradition, by keeping instructions that work on 512-bit vectors as one micro-op throughout most of the pipeline. Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order execution buffers. I assume they’re broken up into two operations after they enter a 256-bit execution pipe, meaning that the instruction is split into two 256-bit halves as late as possible in the pipeline. I also assume that’s what AMD’s “double pumping” is referring to. Compared to Bulldozer and K8’s approach, this is a huge advantage.

That's not fully right. Splitting instruction is needed when cpu has for example only 64 bit registers and need to execute instruction which uses 128 bit registers. When cpu has registers that match instruction register length there's no need to split anything. If execution ALU's are not as wide as register instruction is just looped through alu as many times that whole instruction is executed. There's plenty of cpu designs that use that approach . But AVX512 has huge number of physical registers - taking existing design and just increasing fpu register file to AVX512 isn't area-efficient - AMD solve that problem by redesigning their FPU to use single centrally located register file instead of private register files per FPU as previous Zen designs did. And by doing that AMD's fpu take big step towards Intel big core designs - if they want to expand their FPU to use 512 bit execution pipelines that isn't huge step anymore. But it pretty much looks that they got better implementation by staying with 256 bit execution and 256 bit load/store engine.

Abwx · May 4, 2023

Exist50 said:
Yes, and by all indications, that's what they're doing.

This is not true "double pumping". As originally used, that term meant running part of the pipeline at twice the frequency as the rest. AMD is not doing that here. They are simply cracking a 512b op into two 256b components. And yes, there are certainly complications for the cross-lane interactions, but they're not unsolvable for a 2:1 split.

That was more of a marketing wizard than anything else, since the pipeline is already working at max frequency there s no way that there s a part that would "pump" the frequency by a 2 factor.

The term relate to a staggered ALU, actually 2 ALUs, that can execute two independant ops in one cycle, at this rate FMA is also double pumped.

Was there a P4 model with double-pumped 64-bit operations?

I recall that one of the interesting features of the initial P4 micro-architecture was it's double-pumped ALU. I think Intel called it something like the Rapid Execution Unit, but basically it mean...

stackoverflow.com

And about the topic some info from Computerbase if this wasnt already posted :

Intel Emerald Rapids: Analyse weist auf viel größeren L3-Cache hin

Semi Analysis hat eine umfassende Analyse zur Ausstattung der kommenden Server-CPU-Familie Intel Emerald Rapids erstellt.

www.computerbase.de

naukkis · May 4, 2023

Abwx said:
That was more of a marketing wizard than anything else, since the pipeline is already working at max frequency there s no way that there s a part that would "pump" the frequency by a 2 factor.

The term relate to a staggered ALU, actually 2 ALUs, that can execute two independant ops in one cycle, at this rate FMA is also double pumped.

Actually as both of those ALU's need only half of clock cycle two of them can be driven sequentially in one clock cycle. So alu 1 calculates something and alu2 uses that calculated value in it's inputs in same clock cycle, there can be dependence unlike with case with two parallel alus. Willamette/NW P4's had 0.5 clock cycle alu latency for simple operations. And it also used that double clocking in AGU's and calculated first 16 bits of address in same clock cycle where it was used.

igor_kavinski · May 4, 2023

Wow. EMR is the Sandy Bridge to SPR's Nehalem.

Abwx · May 4, 2023

naukkis said:
Actually as both of those ALU's need only half of clock cycle two of them can be driven sequentially in one clock cycle. So alu 1 calculates something and alu2 uses that calculated value in it's inputs in same clock cycle, there can be dependence unlike with case with two parallel alus. Willamette/NW P4's had 0.5 clock cycle alu latency for simple operations. And it also used that double clocking in AGU's and calculated first 16 bits of address in same clock cycle where it was used.

There s 2 ALUs for one port, this amount to double the throughput per clock cycle if the ops are not dependent.

naukkis · May 4, 2023

Abwx said:
There s 2 ALUs for one port, this amount to double the throughput per clock cycle if the ops are not dependent.

It's also mentioned in that link you pasted, Willamette/NW can do 2 simple alu operations in single clock cycle even if they are dependent. Prescott did lose that double clocking scheme.

Abwx · May 4, 2023

naukkis said:
It's also mentioned in that link you pasted, Willamette/NW can do 2 simple alu operations in single clock cycle even if they are dependent. Prescott did lose that double clocking scheme.

Methink that it s not pure dependency, if the dependency is chained, that is, the result of one computation is needed to compute the following op then there s no way that it could be executed in a single cycle.

naukkis · May 4, 2023

Abwx said:
Methink that it s not pure dependency, if the dependency is chained, that is, the result of one computation is needed to compute the following op then there s no way that it could be executed in a single cycle.

Your picture is quoted from there where it's explained too.

https://courses.cs.washington.edu/courses/cse378/10au/lectures/Pentium4Arch.pdf

Abwx · May 4, 2023

naukkis said:
Your picture is quoted from there where it's explained too.

https://courses.cs.washington.edu/courses/cse378/10au/lectures/Pentium4Arch.pdf

They say that 3 half cycles are required (3 fast clock cycles), 1 op per half cycle (so 2 half cycles for 2 ops) plus one half cycle to process the ALU flag, so 1.5 cycle to have a usable data output, indeed they state that throughput is not doubled but just significantly improved.

Notice that each ALU process at a 16 bit width, so 1 cycle is necessary to process on a 32 bit width, plus said 0.5 cycle to process the ALU flag.

naukkis · May 4, 2023

Abwx said:
They say that 3 half cycles are required (3 fast clock cycles), 1 op per half cycle (so 2 half cycles for 2 ops) plus one half cycle to process the ALU flag, so 1.5 cycle to have a usable data output, indeed they state that throughput is not doubled but just significantly improved.

Notice that each ALU process at a 16 bit width, so 1 cycle is necessary to process on a 32 bit width, plus said 0.5 cycle to process the ALU flag.

It's about dependencies - not total execution time. When lower 16 bits of alu operation is completed it can forwarded to next dependent alu operation within same clock cycle so dependency latency drops to half - two dependent alu operations per clock cycle can calculated throughput wise. It's well explained in that Intel document in part low latency integer alu.

And related to Zen4 - if you read that Pentium4 Intel document you can find Intel describes how their FPU uses 128 bit registers and ports but uses 64 bit arithmetic hardware which completes full 128 bit SSE-operations in two clock cycles. It's absolutely same approach that AMD uses for their Zen4 but with 512 bit registers and execution ports with 256 bit arithmetic units.

Abwx · May 4, 2023

naukkis said:
It's about dependencies - not total execution time. When lower 16 bits of alu operation is completed it can forwarded to next dependent alu operation within same clock cycle so dependency latency drops to half - two dependent alu operations per clock cycle can calculated throughput wise. It's well explained in that Intel document in part low latency integer alu.

I understand the concept, but overall it doesnt bring that much better throughput even if it can cope with dependencies, they are still bound by latencies, among others of L/S unit.

naukkis said:
And related to Zen4 - if you read that Pentium4 Intel document you can find Intel describes how their FPU uses 128 bit registers and ports but uses 64 bit arithmetic hardware which completes full 128 bit SSE-operations in two clock cycles. It's absolutely same approach that AMD uses for their Zen4 but with 512 bit registers and execution ports with 256 bit arithmetic units.

I dont know exactly AMD s all approaches, but operands are indeed 64 bits whatever the width, 128b is just 2 x 64b in a row.

Wether to put enough arithmetic units to execute everything in a cycle or to re use fewer ones and execute the ops in several passes is a up to the designers choice, it can be efficient enough for mixed code, guess that AMD use this approach to save some complexity, and the inherent power and added silicon, so far this seems to work well for AVX512 comparatively to larger units for Intel, at least on a perf/watt point of view.

uzzi38 · May 4, 2023

eek2121 said:
The Zen 4 cloud cores people here keep claiming will magically appear in desktop/mobile chips. 🤣

Desktop no, mobile yes.

Triskain · May 5, 2023

According to recently posted AGESA changelogs there is something called "PHX2 AM5". I thought Phoenix2 is the one with the 2xZen 4 + 4xZen 4c CCX Setup? Any idea if/what APUs will come to the Desktop?

(Excuse the off topic post...)

SteinFG · May 5, 2023

Triskain said:
According to recently posted AGESA changelogs there is something called "PHX2 AM5". I thought Phoenix2 is the one with the 2xZen 4 + 4xZen 4c CCX Setup? Any idea if/what APUs will come to the Desktop?

(Excuse the off topic post...)

There's already a thread for phoenix

Question - AMD Phoenix/Zen 4 APU Speculation and Discussion

I can finally make this thread. Phoenix is indeed RDNA3. My advice to everyone: treat the old APU rumours as being out of date.

forums.anandtech.com

Hulk · May 5, 2023

Intel to Show Off E-Core-Based CPU with Backside Power Delivery

A chip that will never be mass produced will show the benefits of a crucial technology.

www.tomshardware.com

Intel 4 could bring more.

Discussion Intel current and future Lakes & Rapids thread

Diamond Member

Platinum Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Lifer

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Platinum Member

Member

Senior member

Diamond Member