Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 66 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Josh128

Golden Member
Oct 14, 2022
1,499
2,249
106
1760448281569.png

First I hear about this, sounds promising for MI450 / Helios rack platform as thats what these are going to be. 50,000 x 1KW = Another half gigawatt of AMD AI infrastructure. The AI saga continues.
 

MrMPFR

Member
Aug 9, 2025
139
278
96
Detailed reporting and analysis moved to declutter thread. Available here:

Workgroup clusters + globally accessible LDS
Work graphs can morph recursively by splitting and merging nodes. These nodes also share data between each other, so maybe one could assume this new inter-CU fabric is beneficial here as well?
Should avoid excessive load on L2 and much lower fetch latencies. After all it as dataflow architecture with nodes freely communicating with each other. Doing that directly instead of going through L2 would be a massive deal.

Global LDS patent
Think I found a patent related to the interconnect in Neural Arrays.

I don't think L1 is coming back if they take it this far in RDNA 5. Completely redundant. @adroc_thurston you were right about L1 is RDNA 5. I think RDNA 4 removing L1 was a first step.

What about going one step further (as you mentioned "crackpot solution"):
Shared L1 caches across one SE: https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf
- L1 in this context would be the new unified L0/LDS of gfx1250+

DNN execution got accelerated by 2...3x
Think that's the most interesting HW paper I've ever read. Is it by NVIDIA? No it's written by AMD researchers in 2020. The description in the patent aligns a lot with your Shared L1 paper.

I don't think it has to be SE private it can be global like in paper and the patent I linked. Not sure if this is practical and directly conflicts with Huynh so might just be per Shader Array. This shared L1 scheme makes L2 less important and then it probably only needs to hold prefetched data. this is inaccurate. More likely as @basix said "...SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$." Don't think ACE (ADC replaces this) but work items, "work stealing" inter-SE load balancing, and also memory side caching and prefetching sure.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
139
278
96
Forget about weird coprocessor models.
Yeah + @basix's shared L1 makes this redundant, but only if RDNA 5 has it.

Did somebody else already post this new AMD publication?
Nope.

Short summary for those who didnt't read paper.
- Neural Light Sampling uses a Neural Visibility Cache to guide ReSTIR light sampling. Superior to standard ReSTIR but Neural DI is more interesting.
- Neural Direct Illumination MLP only 22% overhead vs ReSTIR. Ran on RDNA 3's joke WWMA shader core implementation, so maybe less overhead on RDNA 4 and 5 or shift (ML faster than RT). Very close to ReSTIR ground truth and no casting of shadow rays after initial training.

Ray tracing is dead. Long live neural rendering!

GPUs are pretty damn good at running GEMM and are really bad at doing RT.
Looks like RDNA 5 might adress those weaknesses. Shifting bottleneck from cachemem to logic.

No, it's simply that some people don't really see all the noise and blur and praise TAA and its derivatives despite the fact that they destroy image clarity, small details and make your head spin with the ghosting in motion.
Ever tried DLSS 4? When it launched even r/f**kTAA praised it as a TAA deblur filter. For non-TAA noise problems (Lumen, Nanite...) maybe some variant of MLP as a future solution.
 
Last edited:

basix

Senior member
Oct 4, 2024
276
549
96
I don't think it has to be SE private it can be global like in paper and the patent I linked. Not sure if this is practical and directly conflicts with Huynh so might just be per Shader Array. This shared L1 scheme makes L2 less important and then it probably only needs to hold prefetched data.
Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).
And at some point further distances mean higher latencies and higher energy draw, which undermines efficiency.
And then the complexity of managing all those memory accesses.

You could theoretically share across a full chip. But as soon as you are thinking about disaggregated GPU chiplets - having that on e.g. MI300/M400 with stacked XCDs - keeping it local to just one SE makes more sense to me.
But as I said, 12 * 448kB is already >5 MByte of fast local cache. This is a huge increase compared to 128kB + 32kB + 512kB of e.g. RDNA3 LDS, L0 and L1. And the L1$ gets shared across 16 CU.

The L2$ is still important. SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$.
 

MrMPFR

Member
Aug 9, 2025
139
278
96
Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).
And at some point further distances mean higher latencies and higher energy draw, which undermines efficiency.
And then the complexity of managing all those memory accesses.

You could theoretically share across a full chip. But as soon as you are thinking about disaggregated GPU chiplets - having that on e.g. MI300/M400 with stacked XCDs - keeping it local to just one SE makes more sense to me.
But as I said, 12 * 448kB is already >5 MByte of fast local cache. This is a huge increase compared to 128kB + 32kB + 512kB of e.g. RDNA3 LDS, L0 and L1. And the L1$ gets shared across 16 CU.

The L2$ is still important. SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$.
From 2020 paper:"Overall, the total power under DynEB is similar to baseline, with <1% reduction averaged across all evaluated applications"
It doesn't matter. Also there isn't higher wiring density because reach of mesh has no influence on wiring density, it just increases latency as described in the uniform cache system patent. In the simulations GPU also seemed to have no issues with increased latency. IPC just kept increasing with more CUs.
Yes but the crossbar NoC seemed to do a pretty good job and didn't require big area investment.

Sure that makes sense. Maybe laying ground-work for disaggregated GPU perhaps in GFX14.

448kB is only CDNA 5, I think Kepler said RDNA 5 has 320kB. But as I showed prev things could get weird and Apple M3-esque. Apple has this so why shouldn't AMD when they do a complete clean slate anyway. Nomatter what much bigger tiles.

Yes I forgot about those additional things. SE to SE is only for "work stealing", but short lived so prob not a big footprint. Memory side caching = prefetching data? Yeah DMA gets data then CP distributes work items to SEs. HWS replaced by simple "work stealing" load-balancing circuit. Pretty sure ACEs are replaced by ADCs in each shader engine. The Geometry pipeline might also be SE local.

Delegating as much work to individual SEs as possible. Repeating myself but this is clearly tailored to chiplets.
 
Last edited:
  • Like
Reactions: Tlh97

basix

Senior member
Oct 4, 2024
276
549
96
Delegating as much work to individual SEs as possible. Repeating myself but this is clearly tailored to chiplets.
If AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.
And to be honest, they have chiplet GPUs already (MI300 series), which should benefit directly of big and shared caches.
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.

Giving SEs more autonomy generally helps in two ways:
  • More localized cache accesses and therefore bandwidth efficiency. Crossing chiplets boundaries will cost something
  • Better scaling to more SEs. If you can distribute work in a lightweight manner and the SE does the heavy lifting, more SEs should scale better due to localized cache accesses and scheduling. As long as there is enough work to do. This applies for both graphics and ML / HPC accelerators.
 
  • Like
Reactions: Tlh97 and MrMPFR

marees

Platinum Member
Apr 28, 2024
2,002
2,629
96
If AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.
And to be honest, they have chiplet GPUs already (MI300 series), which should benefit directly of big and shared caches.
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.

Giving SEs more autonomy generally helps in two ways:
  • More localized cache accesses and therefore bandwidth efficiency. Crossing chiplets boundaries will cost something
  • Better scaling to more SEs. If you can distribute work in a lightweight manner and the SE does the heavy lifting, more SEs should scale better due to localized cache accesses and scheduling. As long as there is enough work to do. This applies for both graphics and ML / HPC accelerators.
Why do you talk as if chiplet gpu is a dead idea

What if amd can resuscitate it for RDNA 6 or 7 etc.?
 
  • Like
Reactions: MrMPFR

marees

Platinum Member
Apr 28, 2024
2,002
2,629
96
because it is (for gaming GPU definitely)
What is the technological issue here. Seems like AMD solved the software problem 🤔

Only hardware problem I can see is transferring data at scale across chiplets
 

basix

Senior member
Oct 4, 2024
276
549
96
Why do you talk as if chiplet gpu is a dead idea

What if amd can resuscitate it for RDNA 6 or 7 etc.?
Dead for RDNA5. Lives with CDNA5. Might get resurrected for RDNA6+

RDNA5 might bring most ingredients for a chiplet gaming GPU. But obviously it seems to be still monolithic (neglecting the MID for now).
But that does not mean, that the work was useless or not serious enough. Also a monolithic GPU should profit from all the changes.

I still like the idea of chiplet based gaming GPUs. And it would be cool if AMD can pull that off in the future.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,684
2,572
136
They're win more devices for people with more money than sanity.
I'd argue that what makes the biggest difference for whether chiplet GPUs make sense or don't is silicon defect density. There have been serious concerns that as we scale to smaller litho, defect density grows until large chips become unmanufacturable. If you expect that, splitting your design into many smaller chiplets you can test separately can make it a lot more economical to make.

But TSMC keeps knocking it out of the park with defect density, meaning that there is little demand for splitting up the GPU.