- Mar 3, 2017
- 1,608
- 5,816
- 136
Admit it, you use it to improve your posts here on the forumI save more than the $20 I pay by using it to automate some stuff I do, without getting into specifics, such as rewording content I have written.
Hmmm, there’s a thought.Admit it, you use it to improve your posts here on the forum
Yeah, had been mentioned in the RDNA4 + CDNA3 thread.I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.
I see some rumors floating around of a 96 Core Zen 4 based CPU in SH5 socket. I admit such a thing did not cross my mind before but it seems like something that makes too much sense to not materialize. This should take technical computing one step above Genoa-X/Turin-X
I don't know if AMD had already conceptualized this tiered memory layout with HBM, but if it materializes it is going to be a monster.
L3--> V-Cache --> Infinity Cache --> HBM --> DDR5
Execufix caught wind of it as usual in the past.
It has beeen speculated for a while. An SH5 article from Sept. 2021 on wccftech mentions the possibility of a Zen 4 cpu in SH5. The bandwidth would be ridiculous; I am not sure a Zen 4 chiplet can really make full use of it, so it might be overkill. Zen 5, if it has significant FP upgrades, may make better use of it. Since it will likely have significantly higher max power, it may be able to run even a 96-core at relatively high clock, so that should be able to consume a lot of bandwidth, especially with AVX512 in use. If they exist, they are going to be very expensive and high power consumption, but I am not sure what would come close to competing with it.
I have been wondering for a long time about the “line” or “page” size for HBM as cache. Cpu caches use (commonly) 64-byte cache lines. This is really tiny for a gpu though, so I have been assuming that the HBM cache functions on probably 4K pages. Anyone seen specifics anywhere?
The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.AVX512 power consumption has not gone off the charts in Zen 4 as is the case with Intel CPUs
One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.The fact that Zen4 executes those instructions over 2 cycles instead of 1 may have something to do with the power consumption there.
Hopefully if that changes to 1 cycle per instruction in Zen5 it will be still efficient enough that it will not be a great burden to the TDP, as with Zen2 gaining full fat AVX2 long after Intel on a more efficient process.
Ahhh so I've had it all wrong then.One of the AMD execs, maybe Papermaster, said that it is better to run at full speed and execute AVX512 instructions every other cycle than at half speed and executing every cycle. Because all the other instructions around the AVX instructions then run at full speed.
In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
To be fair, they did not dump it in the DC world.In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
Emulators seem to have adopted it quite a bit too, certainly in the PS3/X360 generation, not sure about the rest - one even used Intel's transactional memory extensions prior to finding out they are b0rked.In the DC world, we use it quite often. It makes a big difference. Too bad Intel dumped it.
To be fair the DC/server world is the market segment that they have provided the least updates for since their fab node troubles began.To be fair, they did not dump it in the DC world.
Nah, it's two cycles per op. It's cracked into two, 256b pieces.Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.To be fair, they did not dump it in the DC world.
Oh you mean like Folding@Home?DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
Rocket Lake (i9-11900K, i7-11700K, i5-11400K etc.) is the only desktop Intel CPU family to have AVX-512 so far. And it's just a single unit. Intel gives two AVX-512 units to their workstation/server CPUs.DC as in distributed computing. desktop PC's most commonly used in DC only have it for Zen 4 and not sure how old Intel has to be to have it.
well, yes, but primegrid uses it more. I do that just to contribute to the team, but day-to-day its F@H and WCG\Oh you mean like Folding@Home?
Ahhh so I've had it all wrong then.
Zen 4 AVX512 is not running half speed/2 cycle execution as AVX2 did on Zen1/+.
It is in fact running full speed/1 cycle, but only for every 2nd cycle?
Either way more efficient or not it seems like AVX512 will only experience minimal speedup over AVX2 as it is still only executing 1x 512 bit instruction every 2 cycles vs 2x 256 bit instruction every 2 cycles.... 🤔
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.Intel's flat 512bit registers also improve permutation (data ordering and mixing) operations as the data resides in one space and can cross lanes (128bit boundaries) with minimum penalties.
Zen4 FPU also has flat 512 bit registers and has same benefits as Intel. Only load/store and ALU execution pipelines are 256 bit and need looping to operate whole register, but as registers are full 512 bit crossing lanes isn't problem and Zen4 actually has less penalties from lane crossing than Intel.
Without arguing the technical details, Zen4 WITHOUT avx-512 is as good as Intels best with avx-512, and when you add avx-512 to Zen 4, it just adds 20% more. I would say its a good design, since it does not use a lot more power. I know for a fact, since I own a 9654 and a 9554, and use them in the DC world, along with my 7950x's.Wouldn't permutations happen in the load store?
I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Wouldn't permutations happen in the load store?
I'm using permutations loosely. (unpack, swizzle, permute anything moving data)
Load/store happens at register quantity. 512 bits at time for 512 bit registers. Of course AVX has scatter/gather to load/store SIMD registers from multiple sources but if used load/store unit will be port limited to less bandwidth - optimization guiding will be to avoid using those.