Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Magras00

Member
Aug 9, 2025
38
87
46
Interesting stuff. If I sum up SRAM caches of 2x CDNA4 CU (L1, LDS, Instruction Cache) I am landig at 512kB (if I am counting right). Because instructions can probably be shared -> 448kB would be the number.

RDNA4 already incorporates dynamic / out of order register allication (as M3 does). M3 then goes further and unifies its local cache, which we now might see on CDNA5 and RDNA5. But it seems, that the register files do not get merged with LDS and L0?

Edit:
Maybe it is a 512kB SRAM macro, 448kB = L1/LDS replacement and 64kByte = Dedicated Instruction Cache?


AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU
  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
Wrong info and sorry for any confusion . There's no L1 in WGP, only Shader Array. Scalar cache and vector data cache are shared between CUs.

RDNA 4 = LDS (128KB) + L0 (2 x 32kB) = 192KB
vs
CDNA5 = LDS/L0 = 448kB


RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn. Again wrong. AMD´s CU/WGP SRAM system is very complicated compared to NVIDIA. AMD has 6 systems (Instruction, scalar and vector data cache, LDS, VGPR and SGPR) whereas NVIDIA only has 3 (L1, L0-i, VRF). We simply can't make an apples to apples comparison.

In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right? Only logical explanation. NVIDIA has Tensor memory and so does AMD.

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either. We simply don't know yet. Too early to say for certain. GFX13 (RDNA 5) and GFX12.5 (CDNA5) are wildly different on so many levels and release cadence doesn't align either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1. Ignore

Edit: Some info here very misleading so had to retract some of it and add extra info in itallic. Read @Kepler_L2's reply if you want to know more. His description of RDNA4 WGP caches is accurate. Here's the 9070XT WGP from TPU's 9070XT review: https://www.techpowerup.com/review/sapphire-radeon-rx-9070-xt-nitro/2.html

1756381306233.jpeg
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,974
9,305
136
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU
  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.
From C&C: https://old.chipsandcheese.com/2025/06/28/blackwell-nvidias-massive-gpu/
1756322513022.png

I’d like to eventually create my own version of this figure for RDNA 5…
 

Joe NYC

Diamond Member
Jun 26, 2021
3,454
5,047
136
It would make a lot of sense if the different GPU dies can also stand alone as discrete low end graphics cards...its possible that they somehow might. UDNA= Unified DNA, but werent we just told Medusa Point was going to use RDNA 3.5 and not UDNA? Now its all UDNA?

It seems that all the AT3 and AT4 GPU chiplets need is a tiny IO die chiplet to be full standalone dGPUs.

The roadmap shown by MLID mostly concentrates on very late 2026 / early 2027 RDNA5 products. But there may be some transitional products in 2026, that may still use RDNA 3.5, we will see.

There could, in theory, be Strix Halo 1.5, with the same Halo IOD but with Zen 6 CCD. It could, in theory, be the easiest product to release, if the link is compatible between Zen 5 Strix Halo specific CCD and Zen 6 CCD.
 
  • Like
Reactions: Tlh97

Kepler_L2

Senior member
Sep 6, 2020
966
4,009
136
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:
  • LDS = 160kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 240KB
CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU
  • LDS = 64kB
  • L1 vector data cache = 32kB
  • L1 scalar = 16kB
  • Instruction cache (shared with one CU) = 64kB/2 = 32kB
  • Total: 144KB
RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.
To the best of my knowledge:
cache.png
 

Kepler_L2

Senior member
Sep 6, 2020
966
4,009
136
Are there any leaked arch docs/drivers which indicate which way AMD is going with RDNA 5 on wgp vs cu ?
Not directly but you can read between the lines
1756377961591.png

1756378163251.png
"supportsWGP" flag where you would expect a "isGFX1250Plus" and "has gfx1250 instructions" condition rather than checking the GPU generation directly (gfx* instruction flags carry over from one generation to the next, i.e gfx1250 has gfx9/10/11/12 instructions).
 

Magras00

Member
Aug 9, 2025
38
87
46
Saw my post on RDNA4 cache hierarchy getting a lot of attention. Well it's wrong so please look at Kepler's reply. I botched the WGP/CU caches. TPU's 9070XT review came to the rescue.

That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.



251mm for N10. 237mm for N23. It is not that much of a saving.
Then there's the anomaly Navi33. Know it's N6 but that can't explain +39.6% Mtr/mm^3.

I think the M in 'GMD' stands for 'memory'.
I.e. all the DDR shoreline is there.
Yep. See below
GMD = Graphics Memory Die

Does this look right? I'm sure I have some mistakes, so let me know where I should make adjustments. I can also add rows for # of executions units as well, etc.
View attachment 129328
RDNA1-4 has no WGP/CU level caches higher than L0. L1 is located in the Shader Arrays (two per SE). CDNA kept some GCN baggage.
Maybe just rename it to Instruction cache?

Know it's a placeholder but there's no way RDNA5 uses to same LDS/L0 size as CDNA5. Look at NVIDIA Blackwell VRF (if you include Tensor memory) and L1 doubled for DC. But I wonder what AMD will do with the other WGP data stores. GFX13 > GFX12.5, but it really depends on CDNA5. If it's a continuation of CDNA4 so wouldn't be surprised if implementation on RDNA5 side will be completely different.

In the UDNA interview from almost a year ago Jack Huynh didn't state the goal of UDNA was to have the exact same implementation on DC and Client. But he did say it's about having one design team and and now I'm injecting interpretation unifying the foundational design (cache hiearchy and GPU core design) like NVIDIA has done since forever so optimizations apply to ALL markets and SKUs.

@Kepler_L2 is CDNA5 a clean slate µarch like RDNA5?

Then each of those UMC blocks in @Kepler_L2 's diagrams are 2*16bit LPDDR5 and not just 16bit? Otherwise it'll be bandwidth starved.

Sure but it's LPDDR5X/LPDDR6 not LPDDR5. 16/32bit or 24/48bit mode. Would wager AT4 and AT3 is one design like Navi 44 and 48 (See AMD's RDNA4 Hotchips presentation) so if AT3 has LPDDR6 support then so does AT4.
 

Magras00

Member
Aug 9, 2025
38
87
46
In case anyone doubts Kepler's LDS/L0 claim for CDNA5 AMD has been working on this for years. Here's the unified flexible cache patent from late 2022: https://www.patents-review.com/a/20240220409-unified-flexible-cache.html

"As described herein, a unified flexible cache can be a large cache structure that can replace various smaller cache structures, which can simplify design and fabrication and improve yield during manufacturing. In addition, the unified flex cache can be used for various types of caches, such as various levels of processor and/or accelerator caches, and other cache structures for managing a cache hierarchy, such as a probe filter. Because the flex cache can be partitioned into various sized partitions, the cache types are not restricted to a particular size (e.g., limited by the physical structure). Thus, the flex cache can be reconfigured to provide more efficient cache utilization based on system needs."

Applies to high level shared caches between CPU, GPU, NPU etc.., but with some changes it could be adapted to CU-level caches. The implementation in the patent goes well beyond LDS/L0 and even Apple's M3 and later implementation. Sounds like it would be possible to dynamically change cache hierarchy size ratios based on workload needs, but perhaps I'm misunderstanding something.

So maybe there's a slim chance RDNA5 goes all the way like M3, but if not then AMD can't ignore this moving forward. With no area scaling SRAM is too precious to be wasted on fixed stores.