Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

biostud · May 26, 2023

eek2121 said:
I save more than the $20 I pay by using it to automate some stuff I do, without getting into specifics, such as rewording content I have written.

Admit it, you use it to improve your posts here on the forum

DisEnchantment · May 28, 2023

I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.

https://twitter.com/x/status/1438564239933853698

Ajay · May 28, 2023

biostud said:
Admit it, you use it to improve your posts here on the forum

Hmmm, there’s a thought.

moinmoin · May 28, 2023

DisEnchantment said:
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.

https://twitter.com/x/status/1438564239933853698

Yeah, had been mentioned in the RDNA4 + CDNA3 thread.

The memory layout will be really interesting with this thing.

jamescox · May 31, 2023

DisEnchantment said:
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.

https://twitter.com/x/status/1438564239933853698

It has beeen speculated for a while. An SH5 article from Sept. 2021 on wccftech mentions the possibility of a Zen 4 cpu in SH5. The bandwidth would be ridiculous; I am not sure a Zen 4 chiplet can really make full use of it, so it might be overkill. Zen 5, if it has significant FP upgrades, may make better use of it. Since it will likely have significantly higher max power, it may be able to run even a 96-core at relatively high clock, so that should be able to consume a lot of bandwidth, especially with AVX512 in use. If they exist, they are going to be very expensive and high power consumption, but I am not sure what would come close to competing with it.

I have been wondering for a long time about the “line” or “page” size for HBM as cache. Cpu caches use (commonly) 64-byte cache lines. This is really tiny for a gpu though, so I have been assuming that the HBM cache functions on probably 4K pages. Anyone seen specifics anywhere?

Joe NYC · Jun 1, 2023

jamescox said:
It has beeen speculated for a while. An SH5 article from Sept. 2021 on wccftech mentions the possibility of a Zen 4 cpu in SH5. The bandwidth would be ridiculous; I am not sure a Zen 4 chiplet can really make full use of it, so it might be overkill. Zen 5, if it has significant FP upgrades, may make better use of it. Since it will likely have significantly higher max power, it may be able to run even a 96-core at relatively high clock, so that should be able to consume a lot of bandwidth, especially with AVX512 in use. If they exist, they are going to be very expensive and high power consumption, but I am not sure what would come close to competing with it.

I have been wondering for a long time about the “line” or “page” size for HBM as cache. Cpu caches use (commonly) 64-byte cache lines. This is really tiny for a gpu though, so I have been assuming that the HBM cache functions on probably 4K pages. Anyone seen specifics anywhere?

AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs. But that was with limited bandwidth into the chiplet. Who knows what happens if the bandwidth goes up by orders of magnitude.

Looking forward to Zen 5 Turin Dense chiplet, which will likely have 16 core on N3, and will minimize L3 size, some of that old L3 bandwidth demand will move to the system level cache on base die. This would optimize the use of limited and expensive N3 silicon to only the most dense logic, and pretty much all of the low density SRAM, IO, analog would be on N6 base die.

soresu · Jun 1, 2023

Joe NYC said:
AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs

The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.

A/// · Jun 1, 2023

Prior to zen 2 it was 2x 128 bit registers right? was the original zen the sam setup or did amd further bastardise it?

Joe NYC · Jun 1, 2023

soresu said:
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.

One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.

soresu · Jun 2, 2023

Joe NYC said:
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.

Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

Markfw · Jun 2, 2023

soresu said:
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.

BorisTheBlade82 · Jun 2, 2023

Markfw said:
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.

To be fair, they did not dump it in the DC world.

soresu · Jun 2, 2023

Markfw said:
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.

Emulators seem to have adopted it quite a bit too, certainly in the PS3/X360 generation, not sure about the rest - one even used Intel's transactional memory extensions prior to finding out they are b0rked.

soresu · Jun 2, 2023

BorisTheBlade82 said:
To be fair, they did not dump it in the DC world.

To be fair the DC/server world is the market segment that they have provided the least updates for since their fab node troubles began.

How long was it that Sapphire Rapids was in the oven before release?

Exist50 · Jun 2, 2023

soresu said:
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

Nah, it's two cycles per op. It's cracked into two, 256b pieces.

Markfw · Jun 2, 2023

BorisTheBlade82 said:
To be fair, they did not dump it in the DC world.

DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.

soresu · Jun 2, 2023

Markfw said:
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.

Oh you mean like Folding@Home?

igor_kavinski · Jun 2, 2023

Markfw said:
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.

Rocket Lake (i9-11900K, i7-11700K, i5-11400K etc.) is the only desktop Intel CPU family to have AVX-512 so far. And it's just a single unit. Intel gives two AVX-512 units to their workstation/server CPUs.

Intel laptops with Ice Lake (i5-1035G1, i7-1065G7) and Tiger Lake (i5-1135G7, i7-1165G7) are the only laptop Intel CPU families to feature AVX-512.

Seems they were getting ready to promote AVX-512 for mainstream use when Zen 3 threw a wrench into their plans and they were forced to rely on E-cores just to be able to stay in the competition.

Markfw · Jun 2, 2023

soresu said:
Oh you mean like Folding@Home?

well, yes, but primegrid uses it more. I do that just to contribute to the team, but day-to-day its F@H and WCG\

Edit: a lot of other scientific and arithmetic DC apps also use it.

Schmide · Jun 2, 2023

soresu said:
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

These are OOP deep pipeline processors. Cycles per instruction is more about throughput than actual ticks. AVX512 on zen4 has half the throughput as AVX. (1 vs 1/2 *see below) It still takes 11+ cycles to get through the pipeline. The restriction being you can only queue half the instructions needed to fill the pipeline. (i.e. limit to ~5). (these are not actual values I did not look them up)

Edit: * It could be 2 vs 1 as I believe AVX on zen 4 can queue 2 instructions per cycle while probably limiting AVX512 to one per cycle. This would also remove the limit to number of instructions that can queue in the pipeline.

I don't know if this is still the case with intel processors, but they incur a penalty for mixing different levels of simd instructions. (eg mixing AVX512 and AVX) This has to do with the preservation of the upper bits in the register file. AMD has stated that they do not incur this penalty or if they do it is diminished.

Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.

So as for the speedup for AMD, the ceiling for improvement is certainly not has high, but the floor is not as low. It really depends on a lot of factors.

Probably not correct to use generalizations like minimal.

naukkis · Jun 2, 2023

Schmide said:
Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.

Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.

Schmide · Jun 2, 2023

naukkis said:
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.

Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)

Markfw · Jun 2, 2023

Schmide said:
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)

Without arguing the technical details, Zen4 WITHOUT avx-512 is as good as Intels best with avx-512, and when you add avx-512 to Zen 4, it just adds 20% more. I would say its a good design, since it does not use a lot more power. I know for a fact, since I own a 9654 and a 9554, and use them in the DC world, along with my 7950x's.

The picture I added does not seem to show. Its at the bottom of this page.

AVX-512 Performance Comparison: AMD Genoa vs. Intel Sapphire Rapids & Ice Lake Review - Phoronix

www.phoronix.com

naukkis · Jun 2, 2023

Schmide said:
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)

Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.

Schmide · Jun 2, 2023

naukkis said:
Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.

Registers are basically virtual (remapped) until they enter the pipeline. Intel has 512bit ports that can for example

__m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b)
vpermi2d zmm, zmm, zmm

shuffle between lanes. This is what I call a flat register operation.

For operations within lanes it really matters not if its encoded as 2 avx or 1 avx512. The throughput is the same as is the order of operation. The only thing you're saving is the size of the instruction(s).

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Golden Member

Lifer

Diamond Member

Senior member

Golden Member

Platinum Member

Diamond Member

Golden Member

Platinum Member

Moderator Emeritus, Elite Member

Senior member

Platinum Member

Platinum Member

Platinum Member

Moderator Emeritus, Elite Member

Platinum Member

Lifer

Moderator Emeritus, Elite Member

Diamond Member

Senior member

Diamond Member

Moderator Emeritus, Elite Member

Senior member

Diamond Member