Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 81 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.

Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
 

soresu

Diamond Member
Dec 19, 2014
3,230
2,515
136
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
 
  • Like
Reactions: Tlh97 and Joe NYC

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,167
15,312
136
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
Nah, it's two cycles per op. It's cracked into two, 256b pieces.
 
  • Like
Reactions: Tlh97 and Kepler_L2
Jul 27, 2020
20,040
13,738
146
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
Rocket Lake (i9-11900K, i7-11700K, i5-11400K etc.) is the only desktop Intel CPU family to have AVX-512 so far. And it's just a single unit. Intel gives two AVX-512 units to their workstation/server CPUs.

Intel laptops with Ice Lake (i5-1035G1, i7-1065G7) and Tiger Lake (i5-1135G7, i7-1165G7) are the only laptop Intel CPU families to feature AVX-512.

Seems they were getting ready to promote AVX-512 for mainstream use when Zen 3 threw a wrench into their plans and they were forced to rely on E-cores just to be able to stay in the competition.
 

Schmide

Diamond Member
Mar 7, 2002
5,596
730
126
Ahhh so I've had it all wrong then.

Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.

It is in fact running full speed/1 cycle, but only for every 2nd cycle?

Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔

These are OOP deep pipeline processors. Cycles per instruction is more about throughput than actual ticks. AVX512 on zen4 has half the throughput as AVX. (1 vs 1/2 *see below) It still takes 11+ cycles to get through the pipeline. The restriction being you can only queue half the instructions needed to fill the pipeline. (i.e. limit to ~5). (these are not actual values I did not look them up)

Edit: * It could be 2 vs 1 as I believe AVX on zen 4 can queue 2 instructions per cycle while probably limiting AVX512 to one per cycle. This would also remove the limit to number of instructions that can queue in the pipeline.

I don't know if this is still the case with intel processors, but they incur a penalty for mixing different levels of simd instructions. (eg mixing AVX512 and AVX) This has to do with the preservation of the upper bits in the register file. AMD has stated that they do not incur this penalty or if they do it is diminished.

Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.

So as for the speedup for AMD, the ceiling for improvement is certainly not has high, but the floor is not as low. It really depends on a lot of factors.

Probably not correct to use generalizations like minimal.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
903
786
136
Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.
 

Schmide

Diamond Member
Mar 7, 2002
5,596
730
126
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.

Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,167
15,312
136
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Without arguing the technical details, Zen4 WITHOUT avx-512 is as good as Intels best with avx-512, and when you add avx-512 to Zen 4, it just adds 20% more. I would say its a good design, since it does not use a lot more power. I know for a fact, since I own a 9654 and a 9554, and use them in the DC world, along with my 7950x's.

embed.php


The picture I added does not seem to show. Its at the bottom of this page.

 
  • Like
Reactions: Tlh97 and soresu

naukkis

Senior member
Jun 5, 2002
903
786
136
Wouldn't permutations happen in the load store?

I'm using permutations loosely. (unpack, swizzle, permute anything moving data)

Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.
 

Schmide

Diamond Member
Mar 7, 2002
5,596
730
126
Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.

Registers are basically virtual (remapped) until they enter the pipeline. Intel has 512bit ports that can for example

__m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b)
vpermi2d zmm, zmm, zmm

shuffle between lanes. This is what I call a flat register operation.

For operations within lanes it really matters not if its encoded as 2 avx or 1 avx512. The throughput is the same as is the order of operation. The only thing you're saving is the size of the instruction(s).
 

naukkis

Senior member
Jun 5, 2002
903
786
136
Registers are basically virtual (remapped) until they enter the pipeline. Intel has 512bit ports that can for example

__m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b)
vpermi2d zmm, zmm, zmm

shuffle between lanes. This is what I call a flat register operation.

For operations within lanes it really matters not if its encoded as 2 avx or 1 avx512. The throughput is the same as is the order of operation. The only thing you're saving is the size of the instruction(s).

AMD Zen4 has also 512 bit ports. Only execution is 256 bit. As registers are full 512 bit shuffling between lanes isn't problem even with 256-bit execution units. Zen4 fpu is totally different to Zen1 which had 128 bit registers and ports and because of that if splitted 256 bit operations to 2x128bit instructions. Zen4 doesn't do that - 512 bit instructions goes through pipeline as one instruction.

If 512 bit instructions is split to 2 256-bit instructions and lane crossing instruction involves parts of both halves execution will many more instructions than just two and need additional register to register moves. Zen4 with 512 bit registers don't suffer any of those problems.
 
  • Like
Reactions: Tlh97 and moinmoin

Timmah!

Golden Member
Jul 24, 2010
1,513
832
136
Maybe this is a defense of still doing 16c on Zen5 since those cores will no doubt be bigger.



I interpret this as possibly CPU's with both V-cache and E-cores (probably the rumored 'c' cores for AMD). I hope we won't have to deal with a scheduler that have to take 3 types of cores into account at the same time, like high cache, high clock, and high energy efficiency. Probably not though, as we're likely to not see more than 2 CCX'es, at least not in the consumer market.

I think for Zen5 I'd ideally still get the regular Zen5 with 16c to avoid the scheduler issues with V-cache (and iffyness with voltages we're currently seeing). If there's a version with b.L with Zen5c small cores, that might be tempting if there's user cases for more cores, since I have much more faith in the scheduling for b.L working properly after two generation of Intel CPU's with it. However, I've still yet to see a case with my current 7950X where I'd have preferred one Zen4c CCX with more cores for my current uses cases.
Pretty sure this points toward 8C + 16C (of "c" flavor) for Ryzen in the future. This allowing to retain only 2 dies that fit under the IHS. Probably happening with Zen "6" - or whatever it will be called. Since the "c" cores supposedly dont clock as high as the regular ones (right?), maybe they will drop the vcache in there (to make up for their smaller L3).
Cant say i am too happy about this. Still preferable option to rumored 8+32 solution from Intel (unless that will turn out somehow too good). Anyway, i hope it still happens on AM5 platform. Would like to upgrade my 7950x at some point to higher core count CPU without the need to get a new board.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
i think they'll maintain 16 large cores. amd isn't going the way like intel with their c cores. from what i understand they're shrunken down cores with smt and the same cache or slightly reduced cache. zen 6 may be on a new platform fwiw. we should know more about it within the next year as we get closer to the zen 5 release and amd loosens their death grip on their canaries.
 

BorisTheBlade82

Senior member
May 1, 2020
680
1,069
136
Pretty sure this points toward 8C + 16C (of "c" flavor) for Ryzen in the future. This allowing to retain only 2 dies that fit under the IHS. Probably happening with Zen "6" - or whatever it will be called. Since the "c" cores supposedly dont clock as high as the regular ones (right?), maybe they will drop the vcache in there (to make up for their smaller L3).
Cant say i am too happy about this. Still preferable option to rumored 8+32 solution from Intel (unless that will turn out somehow too good). Anyway, i hope it still happens on AM5 platform. Would like to upgrade my 7950x at some point to higher core count CPU without the need to get a new board.
Expecting this as well - maybe even as a surprise for RYZEN 8000 already.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
I wonder if disenchantment knew anything when he made this topic because amd has traditionally skipped generational naming after zen 2.