Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 49 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,662
6,163
146

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,892
136
1658049829762.jpeg

My interpretation of patches so far for an RDNA3 WGP.
The real thing is far more complex, with Accumulators, Operand gather/scatter crossbars etc., present all over the place.
WGP coherent L0 and 2x TPs is hopium on my part.

Some anecdotes to go along, as per my understanding of the matter, naturally
  • Frontend is per WGP as usual and hopefully with an increase in Scalar Cache
  • Cache Hierarchy: SGPR [SIMD level] --> Scalar Cache [WGP Level] as usual. VGPR [SIMD level] --> L0 [CU or hopefully WGP level] as usual. Scalar Cache & L0 --> GL1
  • L0 is addressible by all SIMDs within a WGP [Hopium for the issue below, from manual and optimization guide]
    • While each L0 is coherent within a work-group, software must ensure coherency between the two L0 caches within a dual compute unit.
  • Also number of TPs is just a guess because there has to be a corresponding increase of them to perform image instructions [e.g. texture load and decompression] in order to feed the increased number of SIMDs per CU
    • I did have a brain fart earlier thinking each TP can have 32 threads. I forgot there were 4 TPs per CU. That is why the ray intersections perf events were in groups of 8, which feeds back to the shader in the SIMD via the 32 wide VGPR. 4 TPs * 8 threads = 32 box tests/clock which is SIMD width, duh.
    • In Vega, 4 cycle ops means TP needs to wait 4 cycle for a new image instruction, in RDNA 1 cycle wave32 basically means SIMD can issue image ops every cycle, so they need to upgrade them heavily if they indeed have 4 SIMDs per CU
    • Increasing TPs basically means more RT units per CU/Ray Accelerator.
  • Texture ops goes via TP which in turn may engage L0 or in case data is present in TC will return immediately.
  • Shader vector operations bypass the TP directly from VGPR to L0.
  • VOPD is per SIMD, if VOPD is meant across SIMDs then it is not really dual issue, because each SIMD is anyway 1 issue/cycle and can get a different op from the FE anytime it wants, and has its own VGPR.
    • There is only one opcode for each VOPD instruction, which means it goes to one SIMD unit only
Relying on this enum value to be true to have 4 SIMDs per CU
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
Comparison AMD vs Nvidia current and next gen:

CUsShadersTMUsROPsClocksTFLOPs
6950 XT
(N21)
805120320128231023.7
N3196
(+20%)
12288
(+140%)
768 ?
(+140%)
128-256 ?
(+100%)
3000
(+30%)
73.7
(+211%)

SMsShadersTMUsROPsClocksTFLOPs
3090 Ti
(GA102)
8410752336112186040
RTX 4090
(AD102)
128
(+52%)
16384
(+52%)
512 ?
(+52%)
128 ?
(+14%)
2750
(+48%)
90.1
(+125%)

I will directly translate TFLOPs increase to gaming increase.
Looking at the increase in TFLOPs, you would expect N31 to totally crush AD102 in raster performance, because N31 has 38% higher TFLOPs increase than AD102(311/225=1.38) compared to their predecessors, of course under ideal condition where the rest is not a bottleneck.
Yet, the reality may not be so optimistic.

Why do I say that?
The main reason for the increase in TFLOPs for N31 is the second vALU32 in SIMD32. I am quite skeptical that 1x RDNA3 CU = 2x RDNA2 CU in gaming performance.
If It's actually 1x RDNA3 CU = 1.5-1.8x RDNA2 CU then the increase in TFLOPs would be comparable to 55.3-66.4 TFLOPs in gaming performance and that's only +133-180% instead of +211%.
In the worst case, you are on the level of RTX 4090's increase.

The other reason is that Lovelace should have 128x FP32 + 64x INT32 per SM compared to 64x FP32 + 64x FP32/INT32 in Ampere.
This should provide a significant increase in gaming performance, let's say 25-30%, then It could be +181-193% over RTX 3090TI and RTX 4090 would be on par or a bit faster than N31.
 
Last edited:
  • Like
Reactions: Tlh97 and moinmoin

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
View attachment 64589

My interpretation of patches so far for an RDNA3 WGP.
The real thing is far more complex, with Accumulators, Operand gather/scatter crossbars etc., present all over the place.
WGP coherent L0 and 2x TPs is hopium on my part.

Some anecdotes to go along, as per my understanding of the matter, naturally
  • Frontend is per WGP as usual and hopefully with an increase in Scalar Cache
  • Cache Hierarchy: SGPR [SIMD level] --> Scalar Cache [WGP Level] as usual. VGPR [SIMD level] --> L0 [CU or hopefully WGP level] as usual. Scalar Cache & L0 --> GL1
  • L0 is addressible by all SIMDs within a WGP [Hopium for the issue below, from manual and optimization guide]
  • Also number of TPs is just a guess because there has to be a corresponding increase of them to perform image instructions [e.g. texture load and decompression] in order to feed the increased number of SIMDs per CU
    • I did have a brain fart earlier thinking each TP can have 32 threads. I forgot there were 4 TPs per CU. That is why the ray intersections perf events were in groups of 8, which feeds back to the shader in the SIMD via the 32 wide VGPR. 4 TPs * 8 threads = 32 box tests/clock which is SIMD width, duh.
    • In Vega, 4 cycle ops means TP needs to wait 4 cycle for a new image instruction, in RDNA 1 cycle wave32 basically means SIMD can issue image ops every cycle, so they need to upgrade them heavily if they indeed have 4 SIMDs per CU
    • Increasing TPs basically means more RT units per CU/Ray Accelerator.
  • Texture ops goes via TP which in turn may engage L0 or in case data is present in TC will return immediately.
  • Shader vector operations bypass the TP directly from VGPR to L0.
  • VOPD is per SIMD, if VOPD is meant across SIMDs then it is not really dual issue, because each SIMD is anyway 1 issue/cycle and can get a different op from the FE anytime it wants, and has its own VGPR.
    • There is only one opcode for each VOPD instruction, which means it goes to one SIMD unit only
Relying on this enum value to be true to have 4 SIMDs per CU
4 SIMDs per CU + 2x vALU32 per SIMD is highly unlikely I believe, but I would love It.
I will try to apply this.
2x more SIMD32 per CU could provide 80% higher gaming performance.
2x more vALU32 per SIMD32 could provide 30% higher gaming performance.
If I add It up, It could be 1.8*1.3 = 2.34
1x RDNA2 CU = 2.34x RDNA3 CU
128 FP32+ 64 INT32 in ADA(Lovelace?) could provide 30% higher gaming performance.

Actual performance calculation for N31:
100 * 1.2(CU) * 2.34(RDNA3 CU improvement) * 1.3(clockspeed) => 365
CUsShadersClocksTFLOPsActual
performance
6950 XT
(N21)
805120231023.7100
N3196
(+20%)
24576
(+380%)
3000
(+30%)
147.5
(+522%)
365
(+265%)

Actual performance calculation for AD102:
100 * 1.52(SM) * 1.3(ADA SM improvement) * 1.48(clockspeed) => 293
SMsShadersClocksTFLOPsActual performance
3090 Ti
(GA102)
8410752186040100
RTX 4090
(AD102)
128
(+52%)
16384
(+52%)
2750
(+48%)
90.1
(+125%)
293
(+193%)

365/293 = 1.25
N31 would end up 25% faster than RTX 4090.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,892
136
The main reason for the increase in TFLOPs for N31 is the second vALU32 in SIMD32.
4 SIMDs per CU + 2x vALU32 per SIMD is highly unlikely I believe, but I would love It.
While everything is unconfirmed at the moment, this is what an RDNA3 CU looks like (to me at least)
1658055276128.png
VOPD
v_dual_mul_f32 v11, 0x24681357, v2 :: v_dual_mul_f32 v10, 0x24681357, v5
// GFX11: encoding: [0xff,0x04,0xc6,0xc8,0xff,0x0a,0x0a,0x0b,0x57,0x13,0x68,0x24]
This is a sample VOPD instruction. From high level instruction perspective it looks like two back to back v_dual_mul_f32 ops but executed at once.
You will notice only one opcode 0xff,0x04,0xc6,0xc8,0xff,0x0a,0x0a,0x0b,0x57,0x13,0x68,0x24.
This single opcode will be executed by one SIMD.
The granularity of instruction issue is per SIMD unit. You can see two vector ops x32 wide indicated by VGPRs v11, v2, v10, v5 which are all 32 wide.


NUM_SIMD_PER_CU=4
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/soc21_enum.h#L95

While the NUM_SIMD_PER_CU=4 is not a guarantee, the VOPD are guaranteed because compiler changes are really complicated and not something anyone want to put a red herring in the code to generate a random opcode which can crash the GPU.

If only VOPD changes are true, the gains are not going to be much, VOPD cannot issue two FMAC or two image instructions at once for example. At best 1.3x gain in actual perf.
image instructions are those involving texture load, RT ops, export ops etc.,

If AMD is not playing jebaiting with the kernel patches then the throughput per RDNA3 CU would be 4x that of RDNA2 theoretical/peak.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
If I think about It, adding a second x32 vALU shouldn't increase a CU's size by much(+10-20% ?). Even with 96CU this chip shouldn't be even 300mm2 thanks to separate IF chiplets and 5nm process, unless CU(WGP) is significantly bigger.

Not to mention, for 2x better performance:
100*1.3(clocks)*1.3(2* x32 vALU)*1.2(more CU) = 203
You don't really need 2x more IF and >1.5x bandwidth in the case of N31 in my opinion.

On the other hand, N33 has only 128bit GDDR6 and 64 MB IF and that's totally not sufficient for 16WGP(32CU; 128 SIMD(4xSIMD per CU) + dual x32 vALU per SIMD).
In theory, N33 would have 49 TFLOPs at 3GHz compared to 23.7 TFLOPs for RX 6950XT, which has 2x more bandwidth and 2x IF.

From the leaks, there is no evidence about possible 4x higher peak throughput. The ALU count is only 2x per CU(WGP) and TFLOPs corresponds to that.
I must say, RDNA3 is still a mystery to me, and It could end up either a win or flop.

I expect a gigantic leap in performance/W for the next gen mobile GPUs. Maybe even Phoenix with It's supposed 60W RTX 3060 level of performance would look weak in comparison.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,968
7,722
136
That NUM_SIMD_PER_CU=4 doesn't seem to be something that's new or specific to RDNA 3, given how many times it seems to be in the code.
Good catch, there seem not to be a single instance in the repository where NUM_SIMD_PER_CU is not 0x4.
 
  • Like
Reactions: Frenetic Pony

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,892
136
Good catch, there seem not to be a single instance in the repository where NUM_SIMD_PER_CU is not 0x4.
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16

NUM_SIMD_PER_CU = 0x00000002,
If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

1658064345483.png
Seems many folks don't have a clone of the kernel locally to grep :D

If I think about It, adding a second x32 vALU shouldn't increase a CU's size by much.
From Nemez's work for Navi2x
1658060914471.png

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16


If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

View attachment 64597
Seems many folks don't have a clone of the kernel locally to grep :D


From Nemez's work for Navi2x
View attachment 64593

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc
So adding 2x more x32 vALUs to a RDNA2 CU should increase the CU size by ~0.8 mm^2.
40% increase in CU die size resulting in let’s say 30% more performance is not bad.

1 CU = 2 mm^2
1 CU(+ second x32 vALU per SIMD) = 2.8 mm^2
For <=30% more performance, you need to increase the chip size by:
N21 -> +12.3%.
N22 -> +9.5%
N23 -> +10.8%
N24 -> +12%

BTW, what is MM?
MC - memory controller
IS - infinity cache
MM - ?
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,892
136
MultiMedia , DCN + VCN + Audio block/Azalia

So adding 2x more x32 vALUs to a RDNA2 CU should increase the CU size by ~0.8 mm^2.
I would say much much smaller, remember the SP32 in the diagram above has 128KB of VGPR :).
VOPD did not change the VGPR count. Still 256 VGPR per bank as per LLVM.

4 Banks * 256 VGPRs * 4 (32 bits) * 32 wide = 128KB. VGPRs are Vector registers and they match the SIMD width.
See these 4 Banks provide 4 operands per cycle. Then SGPR provides constants, then the instructions encoding can provide immediates. So you get B,C,D,E operands and 2x K from scalar GPR and imm from opcode.
Which means at best you get 2x throughput, at worst they are dormant and add at most 1.2x die area because not all operations need 3 operands, e.g. A=B*C*D.
Lots of ops are simply A=B*C/B+C/B+k*C etc etc
So you can see why VOPD is a great opportunity to increase throughput.
Keep this picture in mind when thinking about SIMD and how wavefronts and opcodes are dispatched.
1658069010301.png
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,214
6,567
136
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16


If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

View attachment 64597
Seems many folks don't have a clone of the kernel locally to grep :D


From Nemez's work for Navi2x
View attachment 64593

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc
I was wondering, since AMD's optimized TSMC N5 scales logic at roughly 2x that of N7, they could literally pack in twice the SIMD32 without an increase in CU size. Then, assuming the remaining 45% of the CU is SRAM, which scales at only 1.35x, you get an overall reduction in CU size of 12% or so.

Maths: 55% * 2x SIMD32 / 2x logic scaling + 45% / 1.35x SRAM scaling = 88%

If an RDNA2 CU took up ~2 mm2, then an RDNA3 CU should take up around 1.8 mm2 on N5. The actual CU size might be closer to 2 mm2 again with the addition of beefier RT units.

N21 was 520mm2 with 128 MB of Infinity Cache, which I'm guessing took up ~128 mm2 of die space (similar density to Zen L3 cache). Using your number of 254 mm2 for the CUs, that implies the non-CU portion of N21 was roughly 140mm2. Assuming that stuff scales at 1.35x, on N5 it would be closer to 100 mm2. With 96 RDNA3 CUs taking up 192 mm2, adding back in the L2, CP, GE, ACE, RB, and L1 (~100 mm2) you're within the 300-400 mm2 range. It's totally possible, I think, that the GCD could be ~350mm2 which is a pretty economical use of N5 for doubling N21 performance. That's in comparison to Nvidia using >600mm2 of 4N.

Let me know if my math sucks and I am wrong somewhere in this logic train.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
....
N21 was 520mm2 with 128 MB of Infinity Cache, which I'm guessing took up ~128 mm2 of die space (similar density to Zen L3 cache). Using your number of 254 mm2 for the CUs, that implies the non-CU portion of N21 was roughly 140mm2. Assuming that stuff scales at 1.35x, on N5 it would be closer to 100 mm2. With 96 RDNA3 CUs taking up 192 mm2, adding back in the L2, CP, GE, ACE, RB, and L1 (~100 mm2) you're within the 300-400 mm2 range. It's totally possible, I think, that the GCD could be ~350mm2 which is a pretty economical use of N5 for doubling N21 performance. That's in comparison to Nvidia using >600mm2 of 4N.

Let me know if my math sucks and I am wrong somewhere in this logic train.
Your calculation is wrong, because 80 CUs are not 254 mm^2.
For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc
80 CUs are 160mm^2 in size and that leaves 94mm^2 for the rest.
520-254=266mm^2 is for MC, IC, MM and IO(PHY, PCIe etc) and from that 128MB Infinity cache should be only ~80mm^2.
 
Last edited:
  • Like
Reactions: Saylick

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
Tweet
AMD Navi 31 ... Please note: RDNA3 GCD is without memory interface and without Infinity Cache.
GCD 350mm²+
If memory interface and infinity cache is really separate in N31, then CU(WGP) size had to be increased by a lot If we consider that this 350mm^2 is a 5nm chip.

edit: If we removed the memory interface and Infinity cache from N21 then It would be most likely <=400mm^2. N31 GCD shouldn't be much smaller.
 
Last edited:

GodisanAtheist

Diamond Member
Nov 16, 2006
6,910
7,303
136
It would feel so wrong for the 7900xt's biggest chiplet be 350mm2 while the mid-range 7600xt comes in at 400mm2.

I wonder what the total die area will be for the 7900xt/n31 chiplet package? 1x 350mm2 GCD and 6x 50mm2 MCD = 650mm2 total die space? Then maybe 100mm2 I/O die so 750mm2 total?
 

Saylick

Diamond Member
Sep 10, 2012
3,214
6,567
136
It would feel so wrong for the 7900xt's biggest chiplet be 350mm2 while the mid-range 7600xt comes in at 400mm2.

I wonder what the total die area will be for the 7900xt/n31 chiplet package? 1x 350mm2 GCD and 6x 50mm2 MCD = 650mm2 total die space? Then maybe 100mm2 I/O die so 750mm2 total?
"Babe, I know it's only 350mm2 but trust me, I've got the better node"

I think the total die area for N31 is likely in the low 600mm2 range. Math isn't far off from yours, but ~380mm2 GCD + 6x ~40mm2 MCDs = 610mm2.
 

Aapje

Golden Member
Mar 21, 2022
1,429
1,935
106
Even If It was, then let's not forget that N31 needs 6 extra chiplets and that's not free either + the packaging cost.

The chiplet design does allow for a huge halo product and still get very good yields and low defects. Of course, the question is whether they want to have such an expensive product. Will customers be willing to pay that much for an AMD card, even if it is the fastest?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,622
5,892
136
I don't think N33 will be anywhere close to 400mm2. Even if N33 is 350mm2, it will be a massive jump in architectural Xtor gains on all blocks.
1658217766563.png
Comparison with N21 should be easier if we are to believe it is being fabbed on same node.

XGMI is deleted in Navi3x, it was used on N21 only for the Radeon Pro W6800X Duo. Navi3x also removed some legacy features like the legacy geometry, EQAA support and few more.
With die size of ~355mm2, you need 2x CU size, 1.3x FF/ROP/Ras/RB etc., size, 2x L2, 1.5xCP, 1.2x MM to hit that die area. Of course the IC and Bus width gets clipped to half of N21 as rumored.
A more conservative Xtor growth from architectural improvements would put N33 with rumored specs only around the 300mm2 - 310mm2

Somehow I am not fully convinced this >300m2 N33 chip will beat N21 as rumored. It is almost 40% smaller if we go conservative, or at best 35% smaller. Perhaps with 192 bit bus and 96MB IC and a lot more clock, but still lots of resources are halved due to half the number of SEs.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,372
2,864
136
If it was N5, I'd agree with you.
Why would N33 be ~350mm^2 only If It was on 5nm? Even N21 should be smaller on 5nm.

N33 is not that much different compared to N23 and that chip is only 237mm^2 on 7nm.
N33 is built on 6nm and should have the same number of WGP(CU), SE, MC(PHY), I am not sure about ROPs. WGP(CU) should be bigger.
Do you think 113mm^2(350-237) is not enough for additional 32MB cache and architectural changes in CU(WGP) + 10-15% denser process?

N21 vs N22 comparison
Difference in size is 185mm^2.
In that space you have:
+2x SE
+2x WGP(CU)
+32MB IF
+64bit GDDR6
+2x ROPs
+XGMI
 
Last edited: