Discussion Intel current and future Lakes & Rapids thread

Page 803 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

moinmoin

Diamond Member
Jun 1, 2017
5,242
8,456
136
AMD's small cores use the same uarch as their big ones, and most importantly for the Atom comparison, do not scale down as small as Atom does. It's an area efficient solution for the big core, but a liability for the small one. There's also the fact that the biggest market for small cores (cloud) has much less demand for strong vec compute.
Do you happen to know the market share of Atom servers?
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Do you happen to know the market share of Atom servers?
For the market that Sierra Forest and Bergamo will be competing in, nothing right now. Though obviously Intel has a few networking chips that I suppose could count.

For a comparison point, best to look at the ARM vendors, because they're realistically why these small core solutions exist in the first place. Graviton has been a pretty big driver for Amazon, even if they're still majority x86.
 
  • Like
Reactions: A///

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
Did amazon release any white papers on their in house processor design? does anyone know what comes after granite rapids for w class xeons?
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Maybe the VNNI instruction being used extensively for AI workloads?
I would assume the small batch inferencing that you'd want a CPU for have a significant affinity with higher perf/thread (i.e. the big cores). And certainly the smaller cache sizes would hurt a lot, given how memory bandwidth intensive most heavy AVX use cases are.

I'm not going to say that a niche for strong vector compute on the small cores doesn't exist, but all in all, it doesn't seem like the ideal tradeoff for how AMD or Intel are positioning their small core offerings.
 

eek2121

Diamond Member
Aug 2, 2005
3,414
5,051
136
Ur right. Intel 4 has only HP libraries designed specifically for MTL to reach higher frequencies at the cost of more power. MTL may run a bit hotter and draw more power like RPL if it manages to reach higher frequencies (say > 5.5Ghz)

And yes, Intel 3 is full stack. But since it's the first ireration of IFS, it's not expected to be a huge money maker. All bets are on Intel 18A, their next full stack!
Unless Intel botches everything (likely), Meteor Lake appears to be targeting efficient, low power designs. No 5.5ghz to be found. The evidence I’ve seen indicates these chips will be in premium ultrabooks and such. Think 14 cores (6+8) with a 10-20 hour battery life in a sub 2lb laptop.
I think MTL & RPL refresh is set to launch this Q3 (hopefully). Even RPL refresh won't launch b4 Q3 this year i guess.

Since there is no news about ARL tapeout, even a Q3 2024 launch seems a bit doubtful. So, lets forget Q1 2024 for now.

Actually, launching a product based on Intel 20A in Q1 2024 is very unlikely since Intel itself announced that only pdk 0.5 has been released for ARL. They (both design team & the node) are gonna need at least 6 months to finish tweaking the libraries and after that ARL needs a few steppings at least after the tape out and power on to go into manufacturing. So, ARL in Q3 2024 itself is difficult.

At best ARL is a Q3 2024 product with chances of slipping into early 2025.

Not much is know about the health of 18A, but I'm guessing LNL will need a few more months at least after ARL launch; considering it's a brand new architecture on a brand new node and it hasn't taped-out yet. That clearly puts it in Q1 2025 at best (but thats being too optimistic considering the lack of info).

One need only look at Intel’s launch cadence to know Arrow Lake is launching Q4.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
By all indications, AMD is taking two cycles to execute a 256b op, and they are not "double pumping" as Netburst did. If you have a source that AMD is splitting a 512b op across two separate execution units in the same cycle, please post it, because that contradicts everything I've heard from them and reviews (e.g. CaC above).

Don't confuse Netburst double interger alu clock here. Pentium4 also double pumped both integer and fpu instructions - Willamette/Northwood P4 had only 16 bit integer alu and 64 bit FPU alu. Both 32 bit integer and 128bit SSE instructions are executed double pumping instruction twice in execution units, just like what AMD is doing with Zen4.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Quoting from C&C:
Zen 4 partially breaks this tradition, by keeping instructions that work on 512-bit vectors as one micro-op throughout most of the pipeline. Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order execution buffers. I assume they’re broken up into two operations after they enter a 256-bit execution pipe, meaning that the instruction is split into two 256-bit halves as late as possible in the pipeline. I also assume that’s what AMD’s “double pumping” is referring to. Compared to Bulldozer and K8’s approach, this is a huge advantage.


That's not fully right. Splitting instruction is needed when cpu has for example only 64 bit registers and need to execute instruction which uses 128 bit registers. When cpu has registers that match instruction register length there's no need to split anything. If execution ALU's are not as wide as register instruction is just looped through alu as many times that whole instruction is executed. There's plenty of cpu designs that use that approach . But AVX512 has huge number of physical registers - taking existing design and just increasing fpu register file to AVX512 isn't area-efficient - AMD solve that problem by redesigning their FPU to use single centrally located register file instead of private register files per FPU as previous Zen designs did. And by doing that AMD's fpu take big step towards Intel big core designs - if they want to expand their FPU to use 512 bit execution pipelines that isn't huge step anymore. But it pretty much looks that they got better implementation by staying with 256 bit execution and 256 bit load/store engine.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
Yes, and by all indications, that's what they're doing.

This is not true "double pumping". As originally used, that term meant running part of the pipeline at twice the frequency as the rest. AMD is not doing that here. They are simply cracking a 512b op into two 256b components. And yes, there are certainly complications for the cross-lane interactions, but they're not unsolvable for a 2:1 split.

That was more of a marketing wizard than anything else, since the pipeline is already working at max frequency there s no way that there s a part that would "pump" the frequency by a 2 factor.

The term relate to a staggered ALU, actually 2 ALUs, that can execute two independant ops in one cycle, at this rate FMA is also double pumped.


WSUM7.png



And about the topic some info from Computerbase if this wasnt already posted :

1-1080.7fcc62d9.jpg


 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
That was more of a marketing wizard than anything else, since the pipeline is already working at max frequency there s no way that there s a part that would "pump" the frequency by a 2 factor.

The term relate to a staggered ALU, actually 2 ALUs, that can execute two independant ops in one cycle, at this rate FMA is also double pumped.
Actually as both of those ALU's need only half of clock cycle two of them can be driven sequentially in one clock cycle. So alu 1 calculates something and alu2 uses that calculated value in it's inputs in same clock cycle, there can be dependence unlike with case with two parallel alus. Willamette/NW P4's had 0.5 clock cycle alu latency for simple operations. And it also used that double clocking in AGU's and calculated first 16 bits of address in same clock cycle where it was used.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
Actually as both of those ALU's need only half of clock cycle two of them can be driven sequentially in one clock cycle. So alu 1 calculates something and alu2 uses that calculated value in it's inputs in same clock cycle, there can be dependence unlike with case with two parallel alus. Willamette/NW P4's had 0.5 clock cycle alu latency for simple operations. And it also used that double clocking in AGU's and calculated first 16 bits of address in same clock cycle where it was used.


There s 2 ALUs for one port, this amount to double the throughput per clock cycle if the ops are not dependent.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
There s 2 ALUs for one port, this amount to double the throughput per clock cycle if the ops are not dependent.

It's also mentioned in that link you pasted, Willamette/NW can do 2 simple alu operations in single clock cycle even if they are dependent. Prescott did lose that double clocking scheme.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
It's also mentioned in that link you pasted, Willamette/NW can do 2 simple alu operations in single clock cycle even if they are dependent. Prescott did lose that double clocking scheme.

Methink that it s not pure dependency, if the dependency is chained, that is, the result of one computation is needed to compute the following op then there s no way that it could be executed in a single cycle.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136

They say that 3 half cycles are required (3 fast clock cycles), 1 op per half cycle (so 2 half cycles for 2 ops) plus one half cycle to process the ALU flag, so 1.5 cycle to have a usable data output, indeed they state that throughput is not doubled but just significantly improved.

Notice that each ALU process at a 16 bit width, so 1 cycle is necessary to process on a 32 bit width, plus said 0.5 cycle to process the ALU flag.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
They say that 3 half cycles are required (3 fast clock cycles), 1 op per half cycle (so 2 half cycles for 2 ops) plus one half cycle to process the ALU flag, so 1.5 cycle to have a usable data output, indeed they state that throughput is not doubled but just significantly improved.

Notice that each ALU process at a 16 bit width, so 1 cycle is necessary to process on a 32 bit width, plus said 0.5 cycle to process the ALU flag.

It's about dependencies - not total execution time. When lower 16 bits of alu operation is completed it can forwarded to next dependent alu operation within same clock cycle so dependency latency drops to half - two dependent alu operations per clock cycle can calculated throughput wise. It's well explained in that Intel document in part low latency integer alu.


And related to Zen4 - if you read that Pentium4 Intel document you can find Intel describes how their FPU uses 128 bit registers and ports but uses 64 bit arithmetic hardware which completes full 128 bit SSE-operations in two clock cycles. It's absolutely same approach that AMD uses for their Zen4 but with 512 bit registers and execution ports with 256 bit arithmetic units.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
It's about dependencies - not total execution time. When lower 16 bits of alu operation is completed it can forwarded to next dependent alu operation within same clock cycle so dependency latency drops to half - two dependent alu operations per clock cycle can calculated throughput wise. It's well explained in that Intel document in part low latency integer alu.
I understand the concept, but overall it doesnt bring that much better throughput even if it can cope with dependencies, they are still bound by latencies, among others of L/S unit.


And related to Zen4 - if you read that Pentium4 Intel document you can find Intel describes how their FPU uses 128 bit registers and ports but uses 64 bit arithmetic hardware which completes full 128 bit SSE-operations in two clock cycles. It's absolutely same approach that AMD uses for their Zen4 but with 512 bit registers and execution ports with 256 bit arithmetic units.

I dont know exactly AMD s all approaches, but operands are indeed 64 bits whatever the width, 128b is just 2 x 64b in a row.

Wether to put enough arithmetic units to execute everything in a cycle or to re use fewer ones and execute the ops in several passes is a up to the designers choice, it can be efficient enough for mixed code, guess that AMD use this approach to save some complexity, and the inherent power and added silicon, so far this seems to work well for AVX512 comparatively to larger units for Intel, at least on a perf/watt point of view.
 

Triskain

Member
Sep 7, 2009
63
33
91
According to recently posted AGESA changelogs there is something called "PHX2 AM5". I thought Phoenix2 is the one with the 2xZen 4 + 4xZen 4c CCX Setup? Any idea if/what APUs will come to the Desktop?

(Excuse the off topic post...)
 

SteinFG

Senior member
Dec 29, 2021
733
869
106
According to recently posted AGESA changelogs there is something called "PHX2 AM5". I thought Phoenix2 is the one with the 2xZen 4 + 4xZen 4c CCX Setup? Any idea if/what APUs will come to the Desktop?

(Excuse the off topic post...)
There's already a thread for phoenix
 
  • Like
Reactions: coercitiv