Discussion RDNA 5 / UDNA (CDNA Next) speculation

Josh128 · Oct 14, 2025

First I hear about this, sounds promising for MI450 / Helios rack platform as thats what these are going to be. 50,000 x 1KW = Another half gigawatt of AMD AI infrastructure. The AI saga continues.

branch_suggestion · Oct 14, 2025

https://www.amd.com/en/blogs/2025/amd-at-ocp-2025-driving-open-ecosystems-and-scalable-co.html

The big boy is ready.

basix · Oct 14, 2025

Josh128 said:
50,000 x 1KW = Another half gigawatt of AMD AI infrastructure. The AI saga continues.

You are off by one order of magnitude: 50k * 1kW = 50 MW

But promising it is, indeed.

adroc_thurston · Oct 14, 2025

branch_suggestion said:
https://www.amd.com/en/blogs/2025/amd-at-ocp-2025-driving-open-ecosystems-and-scalable-co.html
The big boy is ready.

eyyyy it's a collaborative design with Meta.
who could've guessed

MrMPFR · Oct 14, 2025

Detailed reporting and analysis moved to declutter thread. Available here:

Shared L1 Has Far Reaching Implications

Shared L1 Has Far Reaching Implications Global LDS patent Patent available here: https://patents.google.com/patent/US12001334B2 It's a radiating fabric paradigm. Does nearest neighbour searches first, then propagates out to other CUs further out. The patent is old so likely major tweaks since t...

docs.google.com

Kepler_L2 said:
Workgroup clusters + globally accessible LDS

Work graphs can morph recursively by splitting and merging nodes. These nodes also share data between each other, so maybe one could assume this new inter-CU fabric is beneficial here as well?
Should avoid excessive load on L2 and much lower fetch latencies. After all it as dataflow architecture with nodes freely communicating with each other. Doing that directly instead of going through L2 would be a massive deal.

Global LDS patent
Think I found a patent related to the interconnect in Neural Arrays.

I don't think L1 is coming back if they take it this far in RDNA 5. Completely redundant. @adroc_thurston you were right about L1 is RDNA 5. I think RDNA 4 removing L1 was a first step.

basix said:
What about going one step further (as you mentioned "crackpot solution"):
Shared L1 caches across one SE: https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf
- L1 in this context would be the new unified L0/LDS of gfx1250+

DNN execution got accelerated by 2...3x

Think that's the most interesting HW paper I've ever read. Is it by NVIDIA? No it's written by AMD researchers in 2020. The description in the patent aligns a lot with your Shared L1 paper.

I don't think it has to be SE private it can be global like in paper and the patent I linked. Not sure if this is practical and directly conflicts with Huynh so might just be per Shader Array. This shared L1 scheme makes L2 less important ~~and then it probably only needs to hold prefetched data.~~ this is inaccurate. More likely as @basix said "...SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$." Don't think ACE (ADC replaces this) but work items, "work stealing" inter-SE load balancing, and also memory side caching and prefetching sure.

adroc_thurston · Oct 14, 2025

that's a lot of words

Darkmont · Oct 14, 2025

Jobs people, jobs

adroc_thurston · Oct 14, 2025

Darkmont said:
Jobs people, jobs

in this economy?
might as well announce an OpenAI partnership

Darkmont · Oct 14, 2025

adroc_thurston said:
in this economy?
might as well announce an OpenAI partnership

You can always pick cherries and make Alibaba trinkets in the factory during the Big Beautiful Leap Forward™

MrMPFR · Oct 14, 2025

adroc_thurston said:
Forget about weird coprocessor models.

Yeah + @basix's shared L1 makes this redundant, but only if RDNA 5 has it.

soresu said:
Did somebody else already post this new AMD publication?

Nope.

Short summary for those who didnt't read paper.
- Neural Light Sampling uses a Neural Visibility Cache to guide ReSTIR light sampling. Superior to standard ReSTIR but Neural DI is more interesting.
- Neural Direct Illumination MLP only 22% overhead vs ReSTIR. Ran on RDNA 3's joke WWMA shader core implementation, so maybe less overhead on RDNA 4 and 5 or shift (ML faster than RT). Very close to ReSTIR ground truth and no casting of shadow rays after initial training.

Ray tracing is dead. Long live neural rendering!

adroc_thurston said:
GPUs are pretty damn good at running GEMM and are really bad at doing RT.

Looks like RDNA 5 might adress those weaknesses. Shifting bottleneck from cachemem to logic.

tsamolotoff said:
No, it's simply that some people don't really see all the noise and blur and praise TAA and its derivatives despite the fact that they destroy image clarity, small details and make your head spin with the ghosting in motion.

Ever tried DLSS 4? When it launched even r/f**kTAA praised it as a TAA deblur filter. For non-TAA noise problems (Lumen, Nanite...) maybe some variant of MLP as a future solution.

adroc_thurston · Oct 14, 2025

Even more words. Impressive.

basix · Oct 14, 2025

MrMPFR said:
I don't think it has to be SE private it can be global like in paper and the patent I linked. Not sure if this is practical and directly conflicts with Huynh so might just be per Shader Array. This shared L1 scheme makes L2 less important and then it probably only needs to hold prefetched data.

Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).
And at some point further distances mean higher latencies and higher energy draw, which undermines efficiency.
And then the complexity of managing all those memory accesses.

You could theoretically share across a full chip. But as soon as you are thinking about disaggregated GPU chiplets - having that on e.g. MI300/M400 with stacked XCDs - keeping it local to just one SE makes more sense to me.
But as I said, 12 * 448kB is already >5 MByte of fast local cache. This is a huge increase compared to 128kB + 32kB + 512kB of e.g. RDNA3 LDS, L0 and L1. And the L1$ gets shared across 16 CU.

The L2$ is still important. SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$.

adroc_thurston · Oct 14, 2025

MrMPFR said:
Detailed reporting and analysis moved to declutter thread. Available here:

you should probably make a substack and rant there.
good ol' blogs!

MrMPFR · Oct 14, 2025

basix said:
Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).
And at some point further distances mean higher latencies and higher energy draw, which undermines efficiency.
And then the complexity of managing all those memory accesses.

You could theoretically share across a full chip. But as soon as you are thinking about disaggregated GPU chiplets - having that on e.g. MI300/M400 with stacked XCDs - keeping it local to just one SE makes more sense to me.
But as I said, 12 * 448kB is already >5 MByte of fast local cache. This is a huge increase compared to 128kB + 32kB + 512kB of e.g. RDNA3 LDS, L0 and L1. And the L1$ gets shared across 16 CU.

The L2$ is still important. SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$.

From 2020 paper:"Overall, the total power under DynEB is similar to baseline, with <1% reduction averaged across all evaluated applications"
It doesn't matter. Also there isn't higher wiring density because reach of mesh has no influence on wiring density, it just increases latency as described in the uniform cache system patent. In the simulations GPU also seemed to have no issues with increased latency. IPC just kept increasing with more CUs.
Yes but the crossbar NoC seemed to do a pretty good job and didn't require big area investment.

Sure that makes sense. Maybe laying ground-work for disaggregated GPU perhaps in GFX14.

448kB is only CDNA 5, I think Kepler said RDNA 5 has 320kB. But as I showed prev things could get weird and Apple M3-esque. Apple has this so why shouldn't AMD when they do a complete clean slate anyway. Nomatter what much bigger tiles.

Yes I forgot about those additional things. SE to SE is only for "work stealing", but short lived so prob not a big footprint. Memory side caching = prefetching data? Yeah DMA gets data then CP distributes work items to SEs. HWS replaced by simple "work stealing" load-balancing circuit. Pretty sure ACEs are replaced by ADCs in each shader engine. The Geometry pipeline might also be SE local.

Delegating as much work to individual SEs as possible. Repeating myself but this is clearly tailored to chiplets.

eek2121 · Oct 14, 2025

adroc_thurston said:
Even more words. Impressive.

*attack the post and not the poster*
*wheeze*
*attack the post and not the poster*
*wheeze*

🤣

basix · Oct 15, 2025

MrMPFR said:
Delegating as much work to individual SEs as possible. Repeating myself but this is clearly tailored to chiplets.

If AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.
And to be honest, they have chiplet GPUs already (MI300 series), which should benefit directly of big and shared caches.
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.

Giving SEs more autonomy generally helps in two ways:

More localized cache accesses and therefore bandwidth efficiency. Crossing chiplets boundaries will cost something
Better scaling to more SEs. If you can distribute work in a lightweight manner and the SE does the heavy lifting, more SEs should scale better due to localized cache accesses and scheduling. As long as there is enough work to do. This applies for both graphics and ML / HPC accelerators.

marees · Oct 15, 2025

basix said:
If AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.
And to be honest, they have chiplet GPUs already (MI300 series), which should benefit directly of big and shared caches.
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.

Giving SEs more autonomy generally helps in two ways:

More localized cache accesses and therefore bandwidth efficiency. Crossing chiplets boundaries will cost something

Better scaling to more SEs. If you can distribute work in a lightweight manner and the SE does the heavy lifting, more SEs should scale better due to localized cache accesses and scheduling. As long as there is enough work to do. This applies for both graphics and ML / HPC accelerators.

Why do you talk as if chiplet gpu is a dead idea

What if amd can resuscitate it for RDNA 6 or 7 etc.?

del42sa · Oct 15, 2025

marees said:
Why do you talk as if chiplet gpu is a dead idea

because it is (for gaming GPU definitely)

marees · Oct 15, 2025

del42sa said:
because it is (for gaming GPU definitely)

What is the technological issue here. Seems like AMD solved the software problem 🤔

Only hardware problem I can see is transferring data at scale across chiplets

adroc_thurston · Oct 15, 2025

del42sa said:
because it is (for gaming GPU definitely)

nope.

del42sa · Oct 15, 2025

adroc_thurston said:
nope.

alright, then wake me up when there´s new gaming GPU card using chiplets

basix · Oct 15, 2025

marees said:
Why do you talk as if chiplet gpu is a dead idea

What if amd can resuscitate it for RDNA 6 or 7 etc.?

Dead for RDNA5. Lives with CDNA5. Might get resurrected for RDNA6+

RDNA5 might bring most ingredients for a chiplet gaming GPU. But obviously it seems to be still monolithic (neglecting the MID for now).
But that does not mean, that the work was useless or not serious enough. Also a monolithic GPU should profit from all the changes.

I still like the idea of chiplet based gaming GPUs. And it would be cool if AMD can pull that off in the future.

adroc_thurston · Oct 15, 2025

basix said:
I still like the idea of chiplet based gaming GPUs

They're win more devices for people with more money than sanity.

marees · Oct 15, 2025

basix said:
I still like the idea of chiplet based gaming GPUs. And it would be cool if AMD can pull that off in the future

Bring back the days of Radeon HD 7990

Tuna-Fish · Oct 15, 2025

adroc_thurston said:
They're win more devices for people with more money than sanity.

I'd argue that what makes the biggest difference for whether chiplet GPUs make sense or don't is silicon defect density. There have been serious concerns that as we scale to smaller litho, defect density grows until large chips become unmanufacturable. If you expect that, splitting your design into many smaller chiplets you can test separately can make it a lot more economical to make.

But TSMC keeps knocking it out of the park with defect density, meaning that there is little demand for splitting up the GPU.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Golden Member

Senior member

Senior member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Member

Member

Diamond Member

Senior member

Diamond Member

Member

Diamond Member

Senior member

Platinum Member

Member

Platinum Member

Diamond Member

Member

Senior member

Diamond Member

Platinum Member

Golden Member