Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

jpiniero · Jul 14, 2022

Saylick said:
You're probably right. 21 Gbps GDDR6 seems like a given since it's been available for a while now

That's GDDR6X, and it's a lot different. It's not clear if AMD even supports it.

Saylick · Jul 14, 2022

jpiniero said:
That's GDDR6X, and it's a lot different. It's not clear if AMD even supports it.

Ahh, you're right. Samsung makes 16, 20, and now 24 Gbps GDDR6.

DisEnchantment · Jul 17, 2022

My interpretation of patches so far for an RDNA3 WGP.
The real thing is far more complex, with Accumulators, Operand gather/scatter crossbars etc., present all over the place.
WGP coherent L0 and 2x TPs is hopium on my part.

Some anecdotes to go along, as per my understanding of the matter, naturally

Frontend is per WGP as usual and hopefully with an increase in Scalar Cache
Cache Hierarchy: SGPR [SIMD level] --> Scalar Cache [WGP Level] as usual. VGPR [SIMD level] --> L0 [CU or hopefully WGP level] as usual. Scalar Cache & L0 --> GL1
L0 is addressible by all SIMDs within a WGP [Hopium for the issue below, from manual and optimization guide]
- While each L0 is coherent within a work-group, software must ensure coherency between the two L0 caches within a dual compute unit.
  
  Click to expand...
Also number of TPs is just a guess because there has to be a corresponding increase of them to perform image instructions [e.g. texture load and decompression] in order to feed the increased number of SIMDs per CU
- I did have a brain fart earlier thinking each TP can have 32 threads. I forgot there were 4 TPs per CU. That is why the ray intersections perf events were in groups of 8, which feeds back to the shader in the SIMD via the 32 wide VGPR. 4 TPs * 8 threads = 32 box tests/clock which is SIMD width, duh.
- In Vega, 4 cycle ops means TP needs to wait 4 cycle for a new image instruction, in RDNA 1 cycle wave32 basically means SIMD can issue image ops every cycle, so they need to upgrade them heavily if they indeed have 4 SIMDs per CU
- Increasing TPs basically means more RT units per CU/Ray Accelerator.
Texture ops goes via TP which in turn may engage L0 or in case data is present in TC will return immediately.
Shader vector operations bypass the TP directly from VGPR to L0.
VOPD is per SIMD, if VOPD is meant across SIMDs then it is not really dual issue, because each SIMD is anyway 1 issue/cycle and can get a different op from the FE anytime it wants, and has its own VGPR.
- There is only one opcode for each VOPD instruction, which means it goes to one SIMD unit only

Relying on this enum value to be true to have 4 SIMDs per CU

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/soc21_enum.h#L95

TESKATLIPOKA · Jul 17, 2022

Comparison AMD vs Nvidia current and next gen:

	CUs	Shaders	TMUs	ROPs	Clocks	TFLOPs
6950 XT (N21)	80	5120	320	128	2310	23.7
N31	96 (+20%)	12288 (+140%)	768 ? (+140%)	128-256 ? (+100%)	3000 (+30%)	73.7 (+211%)

	SMs	Shaders	TMUs	ROPs	Clocks	TFLOPs
3090 Ti (GA102)	84	10752	336	112	1860	40
RTX 4090 (AD102)	128 (+52%)	16384 (+52%)	512 ? (+52%)	128 ? (+14%)	2750 (+48%)	90.1 (+125%)

I will directly translate TFLOPs increase to gaming increase.
Looking at the increase in TFLOPs, you would expect N31 to totally crush AD102 in raster performance, because N31 has 38% higher TFLOPs increase than AD102(311/225=1.38) compared to their predecessors, of course under ideal condition where the rest is not a bottleneck.
Yet, the reality may not be so optimistic.

Why do I say that?
The main reason for the increase in TFLOPs for N31 is the second vALU32 in SIMD32. I am quite skeptical that 1x RDNA3 CU = 2x RDNA2 CU in gaming performance.
If It's actually 1x RDNA3 CU = 1.5-1.8x RDNA2 CU then the increase in TFLOPs would be comparable to 55.3-66.4 TFLOPs in gaming performance and that's only +133-180% instead of +211%.
In the worst case, you are on the level of RTX 4090's increase.

The other reason is that Lovelace should have 128x FP32 + 64x INT32 per SM compared to 64x FP32 + 64x FP32/INT32 in Ampere.
This should provide a significant increase in gaming performance, let's say 25-30%, then It could be +181-193% over RTX 3090TI and RTX 4090 would be on par or a bit faster than N31.

TESKATLIPOKA · Jul 17, 2022

DisEnchantment said:
View attachment 64589

My interpretation of patches so far for an RDNA3 WGP.
The real thing is far more complex, with Accumulators, Operand gather/scatter crossbars etc., present all over the place.
WGP coherent L0 and 2x TPs is hopium on my part.

Some anecdotes to go along, as per my understanding of the matter, naturally

Frontend is per WGP as usual and hopefully with an increase in Scalar Cache

Cache Hierarchy: SGPR [SIMD level] --> Scalar Cache [WGP Level] as usual. VGPR [SIMD level] --> L0 [CU or hopefully WGP level] as usual. Scalar Cache & L0 --> GL1

L0 is addressible by all SIMDs within a WGP [Hopium for the issue below, from manual and optimization guide]

Also number of TPs is just a guess because there has to be a corresponding increase of them to perform image instructions [e.g. texture load and decompression] in order to feed the increased number of SIMDs per CU

I did have a brain fart earlier thinking each TP can have 32 threads. I forgot there were 4 TPs per CU. That is why the ray intersections perf events were in groups of 8, which feeds back to the shader in the SIMD via the 32 wide VGPR. 4 TPs * 8 threads = 32 box tests/clock which is SIMD width, duh.

In Vega, 4 cycle ops means TP needs to wait 4 cycle for a new image instruction, in RDNA 1 cycle wave32 basically means SIMD can issue image ops every cycle, so they need to upgrade them heavily if they indeed have 4 SIMDs per CU

Increasing TPs basically means more RT units per CU/Ray Accelerator.

Texture ops goes via TP which in turn may engage L0 or in case data is present in TC will return immediately.

Shader vector operations bypass the TP directly from VGPR to L0.

VOPD is per SIMD, if VOPD is meant across SIMDs then it is not really dual issue, because each SIMD is anyway 1 issue/cycle and can get a different op from the FE anytime it wants, and has its own VGPR.

There is only one opcode for each VOPD instruction, which means it goes to one SIMD unit only

Relying on this enum value to be true to have 4 SIMDs per CU

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/soc21_enum.h#L95

4 SIMDs per CU + 2x vALU32 per SIMD is highly unlikely I believe, but I would love It.
I will try to apply this.
2x more SIMD32 per CU could provide 80% higher gaming performance.
2x more vALU32 per SIMD32 could provide 30% higher gaming performance.
If I add It up, It could be 1.8*1.3 = 2.34
1x RDNA2 CU = 2.34x RDNA3 CU
128 FP32+ 64 INT32 in ADA(Lovelace?) could provide 30% higher gaming performance.

Actual performance calculation for N31:
100 * 1.2(CU) * 2.34(RDNA3 CU improvement) * 1.3(clockspeed) => 365

	CUs	Shaders	Clocks	TFLOPs	Actual performance
6950 XT (N21)	80	5120	2310	23.7	100
N31	96 (+20%)	24576 (+380%)	3000 (+30%)	147.5 (+522%)	365 (+265%)

Actual performance calculation for AD102:
100 * 1.52(SM) * 1.3(ADA SM improvement) * 1.48(clockspeed) => 293

	SMs	Shaders	Clocks	TFLOPs	Actual performance
3090 Ti (GA102)	84	10752	1860	40	100
RTX 4090 (AD102)	128 (+52%)	16384 (+52%)	2750 (+48%)	90.1 (+125%)	293 (+193%)

365/293 = 1.25
N31 would end up 25% faster than RTX 4090.

DisEnchantment · Jul 17, 2022

TESKATLIPOKA said:
The main reason for the increase in TFLOPs for N31 is the second vALU32 in SIMD32.

TESKATLIPOKA said:
4 SIMDs per CU + 2x vALU32 per SIMD is highly unlikely I believe, but I would love It.

While everything is unconfirmed at the moment, this is what an RDNA3 CU looks like (to me at least)

VOPD

[AMDGPU] gfx11 Generate VOPD Instructions · llvm/llvm-project@d1af09a

We form VOPD instructions in the GCNCreateVOPD pass by combining back-to-back component instructions. There are strict register constraints for creating a legal VOPD, namely that the matching oper...

github.com

[AMDGPU] gfx11 VOPD instructions MC support · llvm/llvm-project@07b7fad

VOPD is a new encoding for dual-issue instructions for use in wave32. This patch includes MC layer support only. A VOPD instruction is constituted of an X component (for which there are 13 possibl...

github.com

v_dual_mul_f32 v11, 0x24681357, v2 :: v_dual_mul_f32 v10, 0x24681357, v5
// GFX11: encoding: [0xff,0x04,0xc6,0xc8,0xff,0x0a,0x0a,0x0b,0x57,0x13,0x68,0x24]

This is a sample VOPD instruction. From high level instruction perspective it looks like two back to back v_dual_mul_f32 ops but executed at once.
You will notice only one opcode 0xff,0x04,0xc6,0xc8,0xff,0x0a,0x0a,0x0b,0x57,0x13,0x68,0x24.
This single opcode will be executed by one SIMD.
The granularity of instruction issue is per SIMD unit. You can see two vector ops x32 wide indicated by VGPRs v11, v2, v10, v5 which are all 32 wide.

NUM_SIMD_PER_CU=4
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/soc21_enum.h#L95

While the NUM_SIMD_PER_CU=4 is not a guarantee, the VOPD are guaranteed because compiler changes are really complicated and not something anyone want to put a red herring in the code to generate a random opcode which can crash the GPU.

If only VOPD changes are true, the gains are not going to be much, VOPD cannot issue two FMAC or two image instructions at once for example. At best 1.3x gain in actual perf.
image instructions are those involving texture load, RT ops, export ops etc.,

If AMD is not playing jebaiting with the kernel patches then the throughput per RDNA3 CU would be 4x that of RDNA2 theoretical/peak.

TESKATLIPOKA · Jul 17, 2022

If I think about It, adding a second x32 vALU shouldn't increase a CU's size by much(+10-20% ?). Even with 96CU this chip shouldn't be even 300mm2 thanks to separate IF chiplets and 5nm process, unless CU(WGP) is significantly bigger.

Not to mention, for 2x better performance:
100*1.3(clocks)*1.3(2* x32 vALU)*1.2(more CU) = 203
You don't really need 2x more IF and >1.5x bandwidth in the case of N31 in my opinion.

On the other hand, N33 has only 128bit GDDR6 and 64 MB IF and that's totally not sufficient for 16WGP(32CU; 128 SIMD(4xSIMD per CU) + dual x32 vALU per SIMD).
In theory, N33 would have 49 TFLOPs at 3GHz compared to 23.7 TFLOPs for RX 6950XT, which has 2x more bandwidth and 2x IF.

From the leaks, there is no evidence about possible 4x higher peak throughput. The ALU count is only 2x per CU(WGP) and TFLOPs corresponds to that.
I must say, RDNA3 is still a mystery to me, and It could end up either a win or flop.

I expect a gigantic leap in performance/W for the next gen mobile GPUs. Maybe even Phoenix with It's supposed 60W RTX 3060 level of performance would look weak in comparison.

jpiniero · Jul 17, 2022

That NUM_SIMD_PER_CU=4 doesn't seem to be something that's new or specific to RDNA 3, given how many times it seems to be in the code.

moinmoin · Jul 17, 2022

jpiniero said:
That NUM_SIMD_PER_CU=4 doesn't seem to be something that's new or specific to RDNA 3, given how many times it seems to be in the code.

Good catch, there seem not to be a single instance in the repository where NUM_SIMD_PER_CU is not 0x4.

Build software better, together

GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

github.com

DisEnchantment · Jul 17, 2022

moinmoin said:
Good catch, there seem not to be a single instance in the repository where NUM_SIMD_PER_CU is not 0x4.

Build software better, together

GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

github.com

It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/navi10_enum.h#L433

NUM_SIMD_PER_CU = 0x00000002,

If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

Seems many folks don't have a clone of the kernel locally to grep

TESKATLIPOKA said:
If I think about It, adding a second x32 vALU shouldn't increase a CU's size by much.

From Nemez's work for Navi2x

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc

Olikan · Jul 17, 2022

another interesting thing about these MCDs, is that AMD may reuse them for FGPAs

moinmoin · Jul 17, 2022

DisEnchantment said:
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/navi10_enum.h#L433

Ugh, is Github's search really this bad and incomplete?

TESKATLIPOKA · Jul 17, 2022

DisEnchantment said:
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/navi10_enum.h#L433

If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

View attachment 64597
Seems many folks don't have a clone of the kernel locally to grep

From Nemez's work for Navi2x
View attachment 64593

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc

So adding 2x more x32 vALUs to a RDNA2 CU should increase the CU size by ~0.8 mm^2.
40% increase in CU die size resulting in let’s say 30% more performance is not bad.

1 CU = 2 mm^2
1 CU(+ second x32 vALU per SIMD) = 2.8 mm^2
For <=30% more performance, you need to increase the chip size by:
N21 -> +12.3%.
N22 -> +9.5%
N23 -> +10.8%
N24 -> +12%

BTW, what is MM?
MC - memory controller
IS - infinity cache
MM - ?

DisEnchantment · Jul 17, 2022

TESKATLIPOKA said:
MM

MultiMedia , DCN + VCN + Audio block/Azalia

TESKATLIPOKA said:
So adding 2x more x32 vALUs to a RDNA2 CU should increase the CU size by ~0.8 mm^2.

I would say much much smaller, remember the SP32 in the diagram above has 128KB of VGPR

.
VOPD did not change the VGPR count. Still 256 VGPR per bank as per LLVM.

4 Banks * 256 VGPRs * 4 (32 bits) * 32 wide = 128KB. VGPRs are Vector registers and they match the SIMD width.
See these 4 Banks provide 4 operands per cycle. Then SGPR provides constants, then the instructions encoding can provide immediates. So you get B,C,D,E operands and 2x K from scalar GPR and imm from opcode.
Which means at best you get 2x throughput, at worst they are dormant and add at most 1.2x die area because not all operations need 3 operands, e.g. A=B*C*D.
Lots of ops are simply A=B*C/B+C/B+k*C etc etc
So you can see why VOPD is a great opportunity to increase throughput.
Keep this picture in mind when thinking about SIMD and how wavefronts and opcodes are dispatched.

Saylick · Jul 17, 2022

DisEnchantment said:
It is 2 for RDNA2/1 2x SIMD32, and 4 for GCN, 4x SIMD16

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/navi10_enum.h#L433

If you have worked with HIP or something you can query this also via command line.
This is used by the kernel to know which CU is currently busy or which CUs were harvested by evaluating the CU mask.
It is important for AMD's compute stack

View attachment 64597
Seems many folks don't have a clone of the kernel locally to grep

From Nemez's work for Navi2x
View attachment 64593

As expected a chunk of the die size of a CU is consumed by L0 and LDS, x32 VALU is tiny, only 20% of the size of a CU, this is around 0.4mm2 in size, it is miniscule.
2xSIMD32 in an RDNA2 CU is 55% of the CU, somewhere around 1.1mm2 of die area

Anyway, adding second x32 VALU to a SIMD is really opportunistic, because the VGPR and SGPR have excess operand BW in 1x x32 VALU arrangement and can feed all the operands for a few vector ops in 2x x32 VALU arrangment. With clever arrangement of ops in a wave, you can basically schedule instructions so that you can get most VOPD instructions out of a kernel code.
Ignoring cache, adding VOPD or additional x32 VALU would increase CU size by around 1.2x. Adding 2x SIMD would increase CU die size by aorund 1.5x.
Touching cache however is going to change CU footprint by a lot.

For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc

I was wondering, since AMD's optimized TSMC N5 scales logic at roughly 2x that of N7, they could literally pack in twice the SIMD32 without an increase in CU size. Then, assuming the remaining 45% of the CU is SRAM, which scales at only 1.35x, you get an overall reduction in CU size of 12% or so.

Maths: 55% * 2x SIMD32 / 2x logic scaling + 45% / 1.35x SRAM scaling = 88%

If an RDNA2 CU took up ~2 mm2, then an RDNA3 CU should take up around 1.8 mm2 on N5. The actual CU size might be closer to 2 mm2 again with the addition of beefier RT units.

N21 was 520mm2 with 128 MB of Infinity Cache, which I'm guessing took up ~128 mm2 of die space (similar density to Zen L3 cache). Using your number of 254 mm2 for the CUs, that implies the non-CU portion of N21 was roughly 140mm2. Assuming that stuff scales at 1.35x, on N5 it would be closer to 100 mm2. With 96 RDNA3 CUs taking up 192 mm2, adding back in the L2, CP, GE, ACE, RB, and L1 (~100 mm2) you're within the 300-400 mm2 range. It's totally possible, I think, that the GCD could be ~350mm2 which is a pretty economical use of N5 for doubling N21 performance. That's in comparison to Nvidia using >600mm2 of 4N.

Let me know if my math sucks and I am wrong somewhere in this logic train.

TESKATLIPOKA · Jul 18, 2022

Saylick said:
....
N21 was 520mm2 with 128 MB of Infinity Cache, which I'm guessing took up ~128 mm2 of die space (similar density to Zen L3 cache). Using your number of 254 mm2 for the CUs, that implies the non-CU portion of N21 was roughly 140mm2. Assuming that stuff scales at 1.35x, on N5 it would be closer to 100 mm2. With 96 RDNA3 CUs taking up 192 mm2, adding back in the L2, CP, GE, ACE, RB, and L1 (~100 mm2) you're within the 300-400 mm2 range. It's totally possible, I think, that the GCD could be ~350mm2 which is a pretty economical use of N5 for doubling N21 performance. That's in comparison to Nvidia using >600mm2 of 4N.

Let me know if my math sucks and I am wrong somewhere in this logic train.

Your calculation is wrong, because 80 CUs are not 254 mm^2.

DisEnchantment said:
For comparison, Ignoring MC/IC/MM, N21 is only 254mm2 for 80CUs/4SE + L2 + CP + GE + ACE + RB + L1 etc

80 CUs are 160mm^2 in size and that leaves 94mm^2 for the rest.
520-254=266mm^2 is for MC, IC, MM and IO(PHY, PCIe etc) and from that 128MB Infinity cache should be only ~80mm^2.

TESKATLIPOKA · Jul 19, 2022

AMD Navi 31 ... Please note: RDNA3 GCD is without memory interface and without Infinity Cache.

GCD 350mm²+

Click to expand...

If memory interface and infinity cache is really separate in N31, then CU(WGP) size had to be increased by a lot If we consider that this 350mm^2 is a 5nm chip.

edit: If we removed the memory interface and Infinity cache from N21 then It would be most likely <=400mm^2. N31 GCD shouldn't be much smaller.

GodisanAtheist · Jul 19, 2022

It would feel so wrong for the 7900xt's biggest chiplet be 350mm2 while the mid-range 7600xt comes in at 400mm2.

I wonder what the total die area will be for the 7900xt/n31 chiplet package? 1x 350mm2 GCD and 6x 50mm2 MCD = 650mm2 total die space? Then maybe 100mm2 I/O die so 750mm2 total?

Saylick · Jul 19, 2022

GodisanAtheist said:
It would feel so wrong for the 7900xt's biggest chiplet be 350mm2 while the mid-range 7600xt comes in at 400mm2.

I wonder what the total die area will be for the 7900xt/n31 chiplet package? 1x 350mm2 GCD and 6x 50mm2 MCD = 650mm2 total die space? Then maybe 100mm2 I/O die so 750mm2 total?

"Babe, I know it's only 350mm2 but trust me, I've got the better node"

I think the total die area for N31 is likely in the low 600mm2 range. Math isn't far off from yours, but ~380mm2 GCD + 6x ~40mm2 MCDs = 610mm2.

TESKATLIPOKA · Jul 19, 2022

GodisanAtheist said:
It would feel so wrong for the 7900xt's biggest chiplet be 350mm2 while the mid-range 7600xt comes in at 400mm2.

Why?
I don't think N33 will be bigger than a single N31 GDC.
Even If It was, then let's not forget that N31 needs 6 extra chiplets and that's not free either + the packaging cost.

Aapje · Jul 19, 2022

TESKATLIPOKA said:
Even If It was, then let's not forget that N31 needs 6 extra chiplets and that's not free either + the packaging cost.

The chiplet design does allow for a huge halo product and still get very good yields and low defects. Of course, the question is whether they want to have such an expensive product. Will customers be willing to pay that much for an AMD card, even if it is the fastest?

DisEnchantment · Jul 19, 2022

I don't think N33 will be anywhere close to 400mm2. Even if N33 is 350mm2, it will be a massive jump in architectural Xtor gains on all blocks.

Comparison with N21 should be easier if we are to believe it is being fabbed on same node.

XGMI is deleted in Navi3x, it was used on N21 only for the Radeon Pro W6800X Duo. Navi3x also removed some legacy features like the legacy geometry, EQAA support and few more.
With die size of ~355mm2, you need 2x CU size, 1.3x FF/ROP/Ras/RB etc., size, 2x L2, 1.5xCP, 1.2x MM to hit that die area. Of course the IC and Bus width gets clipped to half of N21 as rumored.
A more conservative Xtor growth from architectural improvements would put N33 with rumored specs only around the 300mm2 - 310mm2

Somehow I am not fully convinced this >300m2 N33 chip will beat N21 as rumored. It is almost 40% smaller if we go conservative, or at best 35% smaller. Perhaps with 192 bit bus and 96MB IC and a lot more clock, but still lots of resources are halved due to half the number of SEs.

Aapje · Jul 19, 2022

DisEnchantment said:
I don't think N33 will be anywhere close to 400mm2. Even if N33 is 350mm2, it will be a massive jump in architectural Xtor gains on all blocks.

N23 is 237 mm2, so 350-400 only makes sense if they use N33 for the 7700 and N34 for the 7600.

jpiniero · Jul 19, 2022

DisEnchantment said:
I don't think N33 will be anywhere close to 400mm2. Even if N33 is 350mm2, it will be a massive jump in architectural Xtor gains on all blocks.

If it was N5, I'd agree with you.

Aapje said:
and N24 for the 7600.

N22 you mean.

TESKATLIPOKA · Jul 19, 2022

jpiniero said:
If it was N5, I'd agree with you.

Why would N33 be ~350mm^2 only If It was on 5nm? Even N21 should be smaller on 5nm.

N33 is not that much different compared to N23 and that chip is only 237mm^2 on 7nm.
N33 is built on 6nm and should have the same number of WGP(CU), SE, MC(PHY), I am not sure about ROPs. WGP(CU) should be bigger.
Do you think 113mm^2(350-237) is not enough for additional 32MB cache and architectural changes in CU(WGP) + 10-15% denser process?

N21 vs N22 comparison
Difference in size is 185mm^2.
In that space you have:
+2x SE
+2x WGP(CU)
+32MB IF
+64bit GDDR6
+2x ROPs
+XGMI

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Lifer

Diamond Member

Golden Member

Platinum Member

Platinum Member

Golden Member

Platinum Member

Lifer

Diamond Member

Golden Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Golden Member

Golden Member

Lifer

Platinum Member