Discussion RDNA 5 / UDNA (CDNA Next) speculation

marees · 2025-08-27T14:10:44-0400

RGT: AMD likely to focus on professionals rather than gamers. Unlikely to produce an "Nvidia killer" gaming card

adroc_thurston · 2025-08-27T14:14:14-0400

marees said:
AMD likely to focus on professionals rather than gamers. Unlikely to produce an "Nvidia killer" gaming card

a) nope
b) final config size is relative to what NV ships by then

Magras00 · 2025-08-27T14:16:10-0400

basix said:
Interesting stuff. If I sum up SRAM caches of 2x CDNA4 CU (L1, LDS, Instruction Cache) I am landig at 512kB (if I am counting right). Because instructions can probably be shared -> 448kB would be the number.

RDNA4 already incorporates dynamic / out of order register allication (as M3 does). M3 then goes further and unifies its local cache, which we now might see on CDNA5 and RDNA5. But it seems, that the register files do not get merged with LDS and L0?

Edit:
Maybe it is a 512kB SRAM macro, 448kB = L1/LDS replacement and 64kByte = Dedicated Instruction Cache?

AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:

LDS = 160kB
L1 vector data cache = 32kB
L1 scalar = 16kB
Instruction cache (shared with one CU) = 64kB/2 = 32kB
Total: 240KB

CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

~~RDNA 4 CU~~

~~LDS = 64kB~~
~~L1 vector data cache = 32kB~~
~~L1 scalar = 16kB~~
~~Instruction cache (shared with one CU) = 64kB/2 = 32kB~~
~~Total: 144KB~~

Wrong info and sorry for any confusion . There's no L1 in WGP, only Shader Array. Scalar cache and vector data cache are shared between CUs.

RDNA 4 = LDS (128KB) + L0 (2 x 32kB) = 192KB
vs
CDNA5 = LDS/L0 = 448kB

RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). ~~Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.~~ Again wrong. AMD´s CU/WGP SRAM system is very complicated compared to NVIDIA. AMD has 6 systems (Instruction, scalar and vector data cache, LDS, VGPR and SGPR) whereas NVIDIA only has 3 (L1, L0-i, VRF). We simply can't make an apples to apples comparison.

In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS ~~(maybe also L1?)~~ is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right? Only logical explanation. NVIDIA has Tensor memory and so does AMD.

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. ~~But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.~~ We simply don't know yet. Too early to say for certain. GFX13 (RDNA 5) and GFX12.5 (CDNA5) are wildly different on so many levels and release cadence doesn't align either.

Kepler said it was shared L0/LDS~~, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.~~ Ignore

Edit: Some info here very misleading so had to retract some of it and add extra info in itallic. Read @Kepler_L2's reply if you want to know more. His description of RDNA4 WGP caches is accurate. Here's the 9070XT WGP from TPU's 9070XT review: https://www.techpowerup.com/review/sapphire-radeon-rx-9070-xt-nitro/2.html

jpiniero · 2025-08-27T14:22:24-0400

adroc_thurston said:
a) nope
b) final config size is relative to what NV ships by then

I remain skeptical of AMD's ability to sell $1k+ gaming cards, regardless of competitiveness. Chasing AI Hype makes more sense.

adroc_thurston · 2025-08-27T14:29:51-0400

jpiniero said:
I remain skeptical of AMD's ability to sell $1k+ gaming cards, regardless of competitiveness.

If they win, they win. Simple as that.

jpiniero said:
Chasing AI Hype makes more sense.

Bubble is dead.

Saylick · 2025-08-27T15:22:38-0400

Magras00 said:
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:

LDS = 160kB

L1 vector data cache = 32kB

L1 scalar = 16kB

Instruction cache (shared with one CU) = 64kB/2 = 32kB

Total: 240KB

CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU

LDS = 64kB

L1 vector data cache = 32kB

L1 scalar = 16kB

Instruction cache (shared with one CU) = 64kB/2 = 32kB

Total: 144KB

RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.

From C&C: https://old.chipsandcheese.com/2025/06/28/blackwell-nvidias-massive-gpu/

I’d like to eventually create my own version of this figure for RDNA 5…

branch_suggestion · 2025-08-27T15:45:12-0400

The Chiphell guy pretty much concurs with everything said so far, 4 dies, 2027 launch, very specifically no AT1.
Oh and also the doubling of CU size.

Hail The Brain Slug · 2025-08-27T16:08:23-0400

branch_suggestion said:
The Chiphell guy pretty much concurs with everything said so far, 4 dies, 2027 launch, very specifically no AT1.

Great, AMD going the nvidia approach and making an unobtanium halo and then the next sku down will be a fraction of the size.

adroc_thurston · 2025-08-27T16:11:11-0400

Hail The Brain Slug said:
Great, AMD going the nvidia approach and making an unobtanium halo and then the next sku down will be a fraction of the size.

oh that's baby stuff.
They can build something far, far more expensive.

Hail The Brain Slug · 2025-08-27T16:14:10-0400

adroc_thurston said:
oh that's baby stuff.
They can build something far, far more expensive.

Well, if it competes with a $2000 nvidia card, aren't they going to price it around that range?

Joe NYC · 2025-08-27T16:53:37-0400

Josh128 said:
It would make a lot of sense if the different GPU dies can also stand alone as discrete low end graphics cards...its possible that they somehow might. UDNA= Unified DNA, but werent we just told Medusa Point was going to use RDNA 3.5 and not UDNA? Now its all UDNA?

It seems that all the AT3 and AT4 GPU chiplets need is a tiny IO die chiplet to be full standalone dGPUs.

The roadmap shown by MLID mostly concentrates on very late 2026 / early 2027 RDNA5 products. But there may be some transitional products in 2026, that may still use RDNA 3.5, we will see.

There could, in theory, be Strix Halo 1.5, with the same Halo IOD but with Zen 6 CCD. It could, in theory, be the easiest product to release, if the link is compatible between Zen 5 Strix Halo specific CCD and Zen 6 CCD.

adroc_thurston · 2025-08-27T17:08:44-0400

Joe NYC said:
There could, in theory, be Strix Halo 1.5, with the same Halo IOD but with Zen 6 CCD

no

Joe NYC · 2025-08-27T17:22:23-0400

adroc_thurston said:
no

"No, it's not compatible" or "No, AMD is not likely going to release such a product"?

inquiss · 2025-08-27T17:28:29-0400

Joe NYC said:
"No, it's not compatible" or "No, AMD is not likely going to release such a product"?

Means the same thing results wise

adroc_thurston · 2025-08-27T18:10:22-0400

Joe NYC said:
"No, it's not compatible" or "No, AMD is not likely going to release such a product"?

No means no.

Kepler_L2 · 2025-08-27T19:54:13-0400

Magras00 said:
AMD doesn't provide a lot of info. Had to go back to the CDNA 3 whitepaper + Chips and Cheese posts to find it all.

CDNA 4 CU:

LDS = 160kB

L1 vector data cache = 32kB

L1 scalar = 16kB

Instruction cache (shared with one CU) = 64kB/2 = 32kB

Total: 240KB

CDNA 4 480kB (LDS/L1/L0) vs CDNA 5 448kB (LDS/L0) so -32kB vs CDNA 4. Do note that the CDNA 4 figure includes L1 scalar and vector cache while Kepler's only included LDS+L0.

RDNA 4 CU

LDS = 64kB

L1 vector data cache = 32kB

L1 scalar = 16kB

Instruction cache (shared with one CU) = 64kB/2 = 32kB

Total: 144KB

RDNA 4 288KB vs RDNA5 ?,

NVIDIA Blackwell consumer SM is 128KB/SM + L0 i-cache (unknown). Compare that to 112KB L1 shared + 32KB instruction cache and now AMD is suddenly very close to where NVIDIA is rn.
In addition AMD already has a 50% larger VGPR (384kB/CU vs 256kB/SM) than NVIDIA's VRF that likely includes scalar registers as well.
Do we know how big the shared L0+LDS (maybe also L1?) is for RDNA5? There's no way it's the same size as CDNA5.

Was also wondering about that very large 512kB VGPR for CDNA and if CDNA5 makes any changes here. AMD mentioned something about a 256Kb VGPR + 256kB AGPR mode and suspect this is to match NVIDIA's Tensor memory. Tensor memory is the same size as their VRF 256KB+256KB per SM and it's also located next to VRF in SM block diagram, so it has to be a Tensor specific VRF right?

I've already tried to get @Kepler_L2 to confirm or deny this but didn't reply so suspect it's only L0/LDS but have no idea. But perhaps register file is unified so no more fixed SGPR and VGPRs, but no info disclosed on that either.

Kepler said it was shared L0/LDS, didn't mention anything about L1 scalar and vector cache, so it's prob not L0/LDS/L1.

To the best of my knowledge:

Saylick · 2025-08-27T20:00:13-0400

Kepler_L2 said:
To the best of my knowledge:
View attachment 129324

Inb4 wccftech yoinks this and makes a 4 paragraph article on it.

Saylick · 2025-08-27T20:52:27-0400

Kepler_L2 said:
To the best of my knowledge:
View attachment 129324

Does this look right? I'm sure I have some mistakes, so let me know where I should make adjustments. I can also add rows for # of executions units as well, etc.

ToTTenTranz · 2025-08-28T04:52:56-0400

adroc_thurston said:
I think the M in 'GMD' stands for 'memory'.
I.e. all the DDR shoreline is there.

Then each of those UMC blocks in @Kepler_L2 's diagrams are 2*16bit LPDDR5 and not just 16bit? Otherwise it'll be bandwidth starved.

ToTTenTranz · 2025-08-28T05:28:25-0400

In his latest video about leaking the PS6 handheld, MLID claims the RDNA5 CUs are not RDNA4 WGP-equivalent.

marees · 2025-08-28T05:37:37-0400

ToTTenTranz said:
In his latest video about leaking the PS6 handheld, MLID claims the RDNA5 CUs are not RDNA4 WGP-equivalent.

Are there any leaked arch docs/drivers which indicate which way AMD is going with RDNA 5 on wgp vs cu ?

Kepler_L2 · 2025-08-28T07:00:07-0400

marees said:
Are there any leaked arch docs/drivers which indicate which way AMD is going with RDNA 5 on wgp vs cu ?

Not directly but you can read between the lines

"supportsWGP" flag where you would expect a "isGFX1250Plus" and "has gfx1250 instructions" condition rather than checking the GPU generation directly (gfx* instruction flags carry over from one generation to the next, i.e gfx1250 has gfx9/10/11/12 instructions).

Magras00 · 2025-08-28T09:04:05-0400

Saw my post on RDNA4 cache hierarchy getting a lot of attention. Well it's wrong so please look at Kepler's reply. I botched the WGP/CU caches. TPU's 9070XT review came to the rescue.

Timorous said:
That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.

251mm for N10. 237mm for N23. It is not that much of a saving.

Then there's the anomaly Navi33. Know it's N6 but that can't explain +39.6% Mtr/mm^3.

adroc_thurston said:
I think the M in 'GMD' stands for 'memory'.
I.e. all the DDR shoreline is there.

Yep. See below

Kepler_L2 said:
GMD = Graphics Memory Die

Saylick said:
Does this look right? I'm sure I have some mistakes, so let me know where I should make adjustments. I can also add rows for # of executions units as well, etc.
View attachment 129328

RDNA1-4 has no WGP/CU level caches higher than L0. L1 is located in the Shader Arrays (two per SE). CDNA kept some GCN baggage.
Maybe just rename it to Instruction cache?

Know it's a placeholder but there's no way RDNA5 uses to same LDS/L0 size as CDNA5. Look at NVIDIA Blackwell VRF (if you include Tensor memory) and L1 doubled for DC. But I wonder what AMD will do with the other WGP data stores. GFX13 > GFX12.5, but it really depends on CDNA5. If it's a continuation of CDNA4 so wouldn't be surprised if implementation on RDNA5 side will be completely different.

In the UDNA interview from almost a year ago Jack Huynh didn't state the goal of UDNA was to have the exact same implementation on DC and Client. But he did say it's about having one design team and and now I'm injecting interpretation unifying the foundational design (cache hiearchy and GPU core design) like NVIDIA has done since forever so optimizations apply to ALL markets and SKUs.

AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

Two become one.

www.tomshardware.com

@Kepler_L2 is CDNA5 a clean slate µarch like RDNA5?

ToTTenTranz said:
Then each of those UMC blocks in @Kepler_L2 's diagrams are 2*16bit LPDDR5 and not just 16bit? Otherwise it'll be bandwidth starved.

Sure but it's LPDDR5X/LPDDR6 not LPDDR5. 16/32bit or 24/48bit mode. Would wager AT4 and AT3 is one design like Navi 44 and 48 (See AMD's RDNA4 Hotchips presentation) so if AT3 has LPDDR6 support then so does AT4.

marees · 2025-08-28T09:25:43-0400

Magras00 said:
@Kepler_L2 is CDNA5 a clean slate µarch like RDNA5?

Maybe CDNA 6? MI500 !?

Magras00 · 2025-08-28T10:02:32-0400

In case anyone doubts Kepler's LDS/L0 claim for CDNA5 AMD has been working on this for years. Here's the unified flexible cache patent from late 2022: https://www.patents-review.com/a/20240220409-unified-flexible-cache.html

"As described herein, a unified flexible cache can be a large cache structure that can replace various smaller cache structures, which can simplify design and fabrication and improve yield during manufacturing. In addition, the unified flex cache can be used for various types of caches, such as various levels of processor and/or accelerator caches, and other cache structures for managing a cache hierarchy, such as a probe filter. Because the flex cache can be partitioned into various sized partitions, the cache types are not restricted to a particular size (e.g., limited by the physical structure). Thus, the flex cache can be reconfigured to provide more efficient cache utilization based on system needs."

Applies to high level shared caches between CPU, GPU, NPU etc.., but with some changes it could be adapted to CU-level caches. The implementation in the patent goes well beyond LDS/L0 and even Apple's M3 and later implementation. Sounds like it would be possible to dynamically change cache hierarchy size ratios based on workload needs, but perhaps I'm misunderstanding something.

So maybe there's a slim chance RDNA5 goes all the way like M3, but if not then AMD can't ignore this moving forward. With no area scaling SRAM is too precious to be wasted on fixed stores.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Golden Member

Diamond Member

Member

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Member

Golden Member

Member