Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 56 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.
 

moinmoin

Diamond Member
Jun 1, 2017
4,954
7,669
136
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.
Yeah, had been mentioned in the RDNA4 + CDNA3 thread.

The memory layout will be really interesting with this thing.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.

It has beeen speculated for a while. An SH5 article from Sept. 2021 on wccftech mentions the possibility of a Zen 4 cpu in SH5. The bandwidth would be ridiculous; I am not sure a Zen 4 chiplet can really make full use of it, so it might be overkill. Zen 5, if it has significant FP upgrades, may make better use of it. Since it will likely have significantly higher max power, it may be able to run even a 96-core at relatively high clock, so that should be able to consume a lot of bandwidth, especially with AVX512 in use. If they exist, they are going to be very expensive and high power consumption, but I am not sure what would come close to competing with it.

I have been wondering for a long time about the “line” or “page” size for HBM as cache. Cpu caches use (commonly) 64-byte cache lines. This is really tiny for a gpu though, so I have been assuming that the HBM cache functions on probably 4K pages. Anyone seen specifics anywhere?
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,965
2,320
106
It has beeen speculated for a while. An SH5 article from Sept. 2021 on wccftech mentions the possibility of a Zen 4 cpu in SH5. The bandwidth would be ridiculous; I am not sure a Zen 4 chiplet can really make full use of it, so it might be overkill. Zen 5, if it has significant FP upgrades, may make better use of it. Since it will likely have significantly higher max power, it may be able to run even a 96-core at relatively high clock, so that should be able to consume a lot of bandwidth, especially with AVX512 in use. If they exist, they are going to be very expensive and high power consumption, but I am not sure what would come close to competing with it.

I have been wondering for a long time about the “line” or “page” size for HBM as cache. Cpu caches use (commonly) 64-byte cache lines. This is really tiny for a gpu though, so I have been assuming that the HBM cache functions on probably 4K pages. Anyone seen specifics anywhere?

AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs. But that was with limited bandwidth into the chiplet. Who knows what happens if the bandwidth goes up by orders of magnitude.

Looking forward to Zen 5 Turin Dense chiplet, which will likely have 16 core on N3, and will minimize L3 size, some of that old L3 bandwidth demand will move to the system level cache on base die. This would optimize the use of limited and expensive N3 silicon to only the most dense logic, and pretty much all of the low density SRAM, IO, analog would be on N6 base die.
 
  • Like
Reactions: Tlh97

soresu

Platinum Member
Dec 19, 2014
2,664
1,863
136
AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
 

Joe NYC

Golden Member
Jun 26, 2021
1,965
2,320
106
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
 

soresu

Platinum Member
Dec 19, 2014
2,664
1,863
136
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
 
  • Like
Reactions: Tlh97 and Joe NYC

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,567
14,520
136
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
Nah, it's two cycles per op. It's cracked into two, 256b pieces.
 
  • Like
Reactions: Tlh97 and Kepler_L2
Jul 27, 2020
16,338
10,349
106
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
Rocket Lake (i9-11900K, i7-11700K, i5-11400K etc.) is the only desktop Intel CPU family to have AVX-512 so far. And it's just a single unit. Intel gives two AVX-512 units to their workstation/server CPUs.

Intel laptops with Ice Lake (i5-1035G1, i7-1065G7) and Tiger Lake (i5-1135G7, i7-1165G7) are the only laptop Intel CPU families to feature AVX-512.

Seems they were getting ready to promote AVX-512 for mainstream use when Zen 3 threw a wrench into their plans and they were forced to rely on E-cores just to be able to stay in the competition.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

These are OOP deep pipeline processors. Cycles per instruction is more about throughput than actual ticks. AVX512 on zen4 has half the throughput as AVX. (1 vs 1/2 *see below) It still takes 11+ cycles to get through the pipeline. The restriction being you can only queue half the instructions needed to fill the pipeline. (i.e. limit to ~5). (these are not actual values I did not look them up)

Edit: * It could be 2 vs 1 as I believe AVX on zen 4 can queue 2 instructions per cycle while probably limiting AVX512 to one per cycle. This would also remove the limit to number of instructions that can queue in the pipeline.

I don't know if this is still the case with intel processors, but they incur a penalty for mixing different levels of simd instructions. (eg mixing AVX512 and AVX) This has to do with the preservation of the upper bits in the register file. AMD has stated that they do not incur this penalty or if they do it is diminished.

Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.

So as for the speedup for AMD, the ceiling for improvement is certainly not has high, but the floor is not as low. It really depends on a lot of factors.

Probably not correct to use generalizations like minimal.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
706
578
136
Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.

Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,567
14,520
136
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Without arguing the technical details, Zen4 WITHOUT avx-512 is as good as Intels best with avx-512, and when you add avx-512 to Zen 4, it just adds 20% more. I would say its a good design, since it does not use a lot more power. I know for a fact, since I own a 9654 and a 9554, and use them in the DC world, along with my 7950x's.

embed.php


The picture I added does not seem to show. Its at the bottom of this page.

 
  • Like
Reactions: Tlh97 and soresu

naukkis

Senior member
Jun 5, 2002
706
578
136
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)

Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.

Registers are basically virtual (remapped) until they enter the pipeline. Intel has 512bit ports that can for example

__m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b)
vpermi2d zmm, zmm, zmm

shuffle between lanes. This is what I call a flat register operation.

For operations within lanes it really matters not if its encoded as 2 avx or 1 avx512. The throughput is the same as is the order of operation. The only thing you're saving is the size of the instruction(s).