Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Magras00

Member
Aug 9, 2025
33
72
46
Interesting stuff. If I sum up SRAM caches of 2x CDNA4 CU (L1, LDS, Instruction Cache) I am landig at 512kB (if I am counting right). Because instructions can probably be shared -> 448kB would be the number.

RDNA4 already incorporates dynamic / out of order register allication (as M3 does). M3 then goes further and unifies its local cache, which we now might see on CDNA5 and RDNA5. But it seems, that the register files do not get merged with LDS and L0?

Edit:
Maybe it is a 512kB SRAM macro, 448kB = L1/LDS replacement and 64kByte = Dedicated Instruction Cache?


AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU

  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.
 

Saylick

Diamond Member
Sep 10, 2012
3,974
9,285
136
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU
  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.
From C&C: https://old.chipsandcheese.com/2025/06/28/blackwell-nvidias-massive-gpu/
1756322513022.png

I’d like to eventually create my own version of this figure for RDNA 5…
 
  • Like
Reactions: Magras00

Joe NYC

Diamond Member
Jun 26, 2021
3,453
5,046
136
It would make a lot of sense if the different GPU dies can also stand alone as discrete low end graphics cards...its possible that they somehow might. UDNA= Unified DNA, but werent we just told Medusa Point was going to use RDNA 3.5 and not UDNA? Now its all UDNA?

It seems that all the AT3 and AT4 GPU chiplets need is a tiny IO die chiplet to be full standalone dGPUs.

The roadmap shown by MLID mostly concentrates on very late 2026 / early 2027 RDNA5 products. But there may be some transitional products in 2026, that may still use RDNA 3.5, we will see.

There could, in theory, be Strix Halo 1.5, with the same Halo IOD but with Zen 6 CCD. It could, in theory, be the easiest product to release, if the link is compatible between Zen 5 Strix Halo specific CCD and Zen 6 CCD.
 

Kepler_L2

Senior member
Sep 6, 2020
965
3,991
136
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU
  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.
To the best of my knowledge:
cache.png