• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Question Zen 6 Speculation Thread

Page 407 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Unified is Zen 7 time for client.

Zen 6 ISA is final it's APX 10.1 alongside FRED

‘novalake’
Intel Nova Lake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, AES, PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC, XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI, MOVDIR64B, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT, PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, AVX-VNNI, UINTR, AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512, SM3, SM4, PREFETCHI, APX_F, AVX10.1, AVX10.2 and MOVRS instruction set support.

‘znver6’
AMD Family 1ah core based CPUs with x86-64 instruction set support. (This supersets BMI, BMI2, CLWB, F16C, FMA, FSGSBASE, AVX, AVX2, ADCX, RDSEED, MWAITX, SHA, CLZERO, AES, PCLMUL, CX16, MOVBE, MMX, SSE, SSE2, SSE3, SSE4A, SSSE3, SSE4.1, SSE4.2, ABM, XSAVEC, XSAVES, CLFLUSHOPT, POPCNT, RDPID, WBNOINVD, PKU, VPCLMULQDQ, VAES, AVX512F, AVX512DQ, AVX512IFMA, AVX512CD, AVX512BW, AVX512VL, AVX512BF16, AVX512VBMI, AVX512VBMI2, AVX512VNNI, AVX512BITALG, AVX512VPOPCNTDQ, GFNI, AVXVNNI, MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, PREFETCHI, AVXVNNIINT8, AVXIFMA, AVX512FP16, AVXNECONVERT, AVX512BMM and 64-bit instruction set extensions.)
 
32 GPRs make compiler engineer lives a bit easier and that's it.
Well, I am not compiler engineer and I would love to have 32 GPRs when writing my SIMD code, to be able to hold more pointers for SIMD loads in registers, loop control variables etc. With current 16 it takes some gymnastics to avoid spills on that.

Not to mention with 32, you can afford to spend some of them on quality of life improvements like having frame pointers always on, instead of always off.
 
Well, I am not compiler engineer and I would love to have 32 GPRs when writing my SIMD code, to be able to hold more pointers for SIMD loads in registers, loop control variables etc. With current 16 it takes some gymnastics to avoid spills on that.

Not to mention with 32, you can afford to spend some of them on quality of life improvements like having frame pointers always on, instead of always off.
Nova Lake buyer confirmed
 
Well, I am not compiler engineer and I would love to have 32 GPRs when writing my SIMD code, to be able to hold more pointers for SIMD loads in registers, loop control variables etc. With current 16 it takes some gymnastics to avoid spills on that.

Not to mention with 32, you can afford to spend some of them on quality of life improvements like having frame pointers always on, instead of always off.
when did you realise you needed more registers ie in which year?

Serious question.
 
Considering everything is held in the physical register files that are 224+, it's really comes down to doubling the register alias table and implementing the decode of the APX instructions. Seems pretty probable.

I also think it is doable, but is it probable?

Doable as a microcode update. But it may be a bit risky, AMD will let this one go until Zen 7
 
when did you realise you needed more registers ie in which year?

Serious question.
After I got Zen4, ported my homebrew FFT code to use Radix8 kernel (what itself was possible because AVX512 gives you 32 regs). So after Nov 2022. Cannot tell you exactly what day, but tuning the code I spotted clang was spilling a lot of GPRs to stack (let's say it was precomputing strided addresses ahead of time partially spilling them to stack and later reading them back). When I rewrote the loop to force it to compute the addresses as it goes then I got some perf back.

If it had more GPRs available then it would be able to keep them all in the GPRs together with the loop control and other aux stuff.

So yes, in that instance it was possible to work around the problem, but having more GPRs would make my life simpler😉

In general having more of them, from software point of view, never hurts. Only when you think about HW implementation this becoming a game of trade-offs😉
 
1991 when I took my Assembly level programming class in college. Looks like we will get them by 2031 🙂
So you never evolved since then?

Although there may be some routines that can do register magic and catch them all.

Most operations do not need that many and can actually just use and forget. (register renaming will take care of it) Moreover the fact that you can address memory for single use makes use and forget even easier. This is, in my opinion, the main advantage of x86.
 
Last edited:
After I got Zen4, ported my homebrew FFT code to use Radix8 kernel (what itself was possible because AVX512 gives you 32 regs). So after Nov 2022. Cannot tell you exactly what day, but tuning the code I spotted clang was spilling a lot of GPRs to stack (let's say it was precomputing strided addresses ahead of time partially spilling them to stack and later reading them back). When I rewrote the loop to force it to compute the addresses as it goes then I got some perf back.

If it had more GPRs available then it would be able to keep them all in the GPRs together with the loop control and other aux stuff.

So yes, in that instance it was possible to work around the problem, but having more GPRs would make my life simpler😉

In general having more of them, from software point of view, never hurts. Only when you think about HW implementation this becoming a game of trade-offs😉
Good write up, looks like a future Zen cpu with APX will make your life easier. Also it’s fine if you don’t remember the day lol 😂
 
After I got Zen4, ported my homebrew FFT code to use Radix8 kernel (what itself was possible because AVX512 gives you 32 regs). So after Nov 2022. Cannot tell you exactly what day, but tuning the code I spotted clang was spilling a lot of GPRs to stack (let's say it was precomputing strided addresses ahead of time partially spilling them to stack and later reading them back). When I rewrote the loop to force it to compute the addresses as it goes then I got some perf back.

If it had more GPRs available then it would be able to keep them all in the GPRs together with the loop control and other aux stuff.

So yes, in that instance it was possible to work around the problem, but having more GPRs would make my life simpler😉

In general having more of them, from software point of view, never hurts. Only when you think about HW implementation this becoming a game of trade-offs😉

So there is another ATer smashing their face into FFT programming. I don't force all my calculations into registers. I use a ping pong buffer for those radix. I may go back and code it with just registers, but my bottlenecks really lie below.

radix16-512 no issues.

1024 on. The thing that's kicking my ass is way and spinlock (barrier) management. If you're grabbing data from greater than 64k strides and have twiddle and spinlock each in one way, you can easily blow your way budget.

zen 4 and 5 this isn't an issue as you have 12 - 16 ways, but you have to code for the least common denominator.
 
I also think it is doable, but is it probable?

Doable as a microcode update. But it may be a bit risky, AMD will let this one go until Zen 7
I think you are likely correct.... but I kinda wish you weren't 😉.
So you never evolved since then?

Although there may be some routines that can do register magic and catch them all.

Most operations do not need that many and can actually just use and forget. (register renaming will take care of it) Moreover the fact that you can address memory for single use makes use and forget even easier. This is, in my opinion, the main advantage of x86.
Well, kind of.

I haven't written much asm for a very long time. Write C as little as possible, and generally stay at C++ for most of the time (with a little smattering of Flutter Dart in there just to really confuse the soul).

Even in the embedded stuff I oversee, generally this level of code is only in the startup code on the micro (register setup of the chips pins and interfaces) .... and even this is now mostly done by configuration tools provided by the micro OEM (kids have it so easy these days) 🙂.
 
I think you are likely correct.... but I kinda wish you weren't 😉.

Well, kind of.

I haven't written much asm for a very long time. Write C as little as possible, and generally stay at C++ for most of the time (with a little smattering of Flutter Dart in there just to really confuse the soul).

Even in the embedded stuff I oversee, generally this level of code is only in the startup code on the micro (register setup of the chips pins and interfaces) .... and even this is now mostly done by configuration tools provided by the micro OEM (kids have it so easy these days) 🙂.
I don't write assembly anymore, intrinsics made that a thing of the past, but I do look at the disassembly of my code. You can basically write assembly level code with tight c/c++.
 
Back
Top