- Mar 3, 2017
- 1,747
- 6,598
- 136
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.
Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
Ahhh so I've had it all wrong then.One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
To be fair, they did not dump it in the DC world.In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
Emulators seem to have adopted it quite a bit too, certainly in the PS3/X360 generation, not sure about the rest - one even used Intel's transactional memory extensions prior to finding out they are b0rked.In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
To be fair the DC/server world is the market segment that they have provided the least updates for since their fab node troubles began.To be fair, they did not dump it in the DC world.
Nah, it's two cycles per op. It's cracked into two, 256b pieces.Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.To be fair, they did not dump it in the DC world.
Oh you mean like Folding@Home?DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
Rocket Lake (i9-11900K, i7-11700K, i5-11400K etc.) is the only desktop Intel CPU family to have AVX-512 so far. And it's just a single unit. Intel gives two AVX-512 units to their workstation/server CPUs.DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
well, yes, but primegrid uses it more. I do that just to contribute to the team, but day-to-day its F@H and WCG\Oh you mean like Folding@Home?
Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.
Without arguing the technical details, Zen4 WITHOUT avx-512 is as good as Intels best with avx-512, and when you add avx-512 to Zen 4, it just adds 20% more. I would say its a good design, since it does not use a lot more power. I know for a fact, since I own a 9654 and a 9554, and use them in the DC world, along with my 7950x's.Wouldn't permutations happen in the load store?
I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Wouldn't permutations happen in the load store?
I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.
Registers are basically virtual (remapped) until they enter the pipeline. Intel has 512bit ports that can for example
__m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b)
vpermi2d zmm, zmm, zmm
shuffle between lanes. This is what I call a flat register operation.
For operations within lanes it really matters not if its encoded as 2 avx or 1 avx512. The throughput is the same as is the order of operation. The only thing you're saving is the size of the instruction(s).
Pretty sure this points toward 8C + 16C (of "c" flavor) for Ryzen in the future. This allowing to retain only 2 dies that fit under the IHS. Probably happening with Zen "6" - or whatever it will be called. Since the "c" cores supposedly dont clock as high as the regular ones (right?), maybe they will drop the vcache in there (to make up for their smaller L3).Maybe this is a defense of still doing 16c on Zen5 since those cores will no doubt be bigger.
I interpret this as possibly CPU's with both V-cache and E-cores (probably the rumored 'c' cores for AMD). I hope we won't have to deal with a scheduler that have to take 3 types of cores into account at the same time, like high cache, high clock, and high energy efficiency. Probably not though, as we're likely to not see more than 2 CCX'es, at least not in the consumer market.
I think for Zen5 I'd ideally still get the regular Zen5 with 16c to avoid the scheduler issues with V-cache (and iffyness with voltages we're currently seeing). If there's a version with b.L with Zen5c small cores, that might be tempting if there's user cases for more cores, since I have much more faith in the scheduling for b.L working properly after two generation of Intel CPU's with it. However, I've still yet to see a case with my current 7950X where I'd have preferred one Zen4c CCX with more cores for my current uses cases.
Expecting this as well - maybe even as a surprise for RYZEN 8000 already.Pretty sure this points toward 8C + 16C (of "c" flavor) for Ryzen in the future. This allowing to retain only 2 dies that fit under the IHS. Probably happening with Zen "6" - or whatever it will be called. Since the "c" cores supposedly dont clock as high as the regular ones (right?), maybe they will drop the vcache in there (to make up for their smaller L3).
Cant say i am too happy about this. Still preferable option to rumored 8+32 solution from Intel (unless that will turn out somehow too good). Anyway, i hope it still happens on AM5 platform. Would like to upgrade my 7950x at some point to higher core count CPU without the need to get a new board.
