I found the corresponding patent for what seems like the dual wave32 support in RDNA3.
As per LLVM commits RDNA3 can issue dual wave32 instructions which looks like what is described in this patent.
View attachment 63171
Each SIMD unit can do 2x FP32 whenever the operand cache can gather all the operands from the VGPR bank.
So whenever operand gather is optimal each CU of RDNA3 can do 2x the FP32 of RDNA2 CU per cycle.
When needed, RDNA3 can do 1 cycle wave64.
From latest bunch of commits, which seems to be the most I have seen to support a GPU architecture (much more than RDNA2) RDNA3 scatter gather support in LLVM was thoroughly reworked.
Also found the commit indicating RDNA3 has 1/2 DPFP of RDNA2 (i.e. 1/16 in RDNA2 vs 1/32 in DRNA3) throughput. which could support the idea of 2x FP throughput per CU