Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 67 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MrMPFR

Member
Aug 9, 2025
139
278
96
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.
Do we know if that extends to MALL too? In crackpot paper only the case for shared global L1.

Sure if it makes sense for the workload I can't see why not. Shared L1 friendly IPC of workloads only kept increasing with more cores. Extra latency not important.
Maybe it could permit enhanced "work stealing" where CP moves work from SE 1 to SE 2 bypassing L2 entirely?

For all workloads it should maintain the benefits of SE autonomy and exhausting local caches across entire GPU at the same time.

Maybe the cable management nightmare claim is related to the scheduling and dispatch part (work group clusters) instead of globally accessible LDS?

Inter-XCD data fabric in AID sounds neat.

Giving SEs more autonomy generally helps in two ways:
Your points pretty much mirror the likely GFX13 scheduling patents 1:1:

Tangent about NNs and FSR
Another reason to think AMD uses the crackpot LDS scheme is that NN's have very high replication ratios (duplicative data). For a single CU tile evaluator it's not an issue with WGP takeover, but with many CUs shared effective MB will be bogged down by duplicative data.
Shared global L1 enables CNN tile size to grow linearly with the number of CUs in a Neural Array. For HW good luck trying to run it. Completely essential for enabling high performance large tile CNNs when shared L1 speedup directly respond to the replication ratio.

ViT are GEMM heavy (lower replication ratio) so in this respect these will benefit less from shared L1. But with crossbar NoC P-GEMM actually saw highest speedup (~2.5X). It'll be very interesting to see what effect Neural Arrays has here. CNN = tunnel vision, ViT = parorama vision. Data access beyond one SE important for this stage so huge missed opportunity if CU's has to submit data request through L2.
 
  • Like
Reactions: basix

MrMPFR

Member
Aug 9, 2025
139
278
96
Thanks for the interesting AMD neural rendering paper link and info snippet @soresu.

I'm trying to guesstimate perf on RDNA 4 for these new neural rendering techniques vs ReSTIR. GATE vs hash grid for MLPs. I know RDNA 3 (7900XT in NVC Paper) vs RDNA 4 (9070XT paper) isn't apples to apples. But if anything on RDNA 4 ML perf gain likely exceeds RT, so likely only in neural renderings favour vs RDNA 3.

A snippet from that GATE paper dealing with more efficient ways to encode neural data in 3D scenes...
9070XT - GATE speedup:
NAO: 2.5-2.93X
NRC: 1.7-2.02X

Did somebody else already post this new AMD publication?
In speed guessing roughly:
NLS = NRC
NAO = Neural DI
Ignoring NLS for now since it's a mixed pipeline like NRC.

Hash-Grid
ReSTIR = 3.61 ms
NLS = 3.95 ms
Neural DI = 4.41 ms

GATE speedup = 4.41ms/2.5-2.93 = 1.51-1.76 ms
GATE
ReSTIR = 3.61 ms
NLS = ?
Neural DI = 1.51-1.76 ms

From 21% overhead to 51-58% saving or from significantly slower to over twice as fast.

Even if this is off by a mile GATE is still a gamechanger for neural rendering, and I don't think RDNA 5 will disrupt this neural vs PT situation. Especially not if Neural Arrays and shared global l1 can deliver large increase on top other ML gains.

When neural rendering looks better than ReSTIR AND runs much faster, why even bother implement PT. At this rate bleeding edge PT in new releases will be dead before PS6 even launches. This paper can even do caustics, but it's not quite there.

Either I'm wrong or last weeks Project Amethyst video isn't even tip of the iceberg of what's to come on SW and HW side.
 
  • Like
Reactions: Tlh97 and Elfear

maddie

Diamond Member
Jul 18, 2010
5,178
5,576
136
I'd argue that what makes the biggest difference for whether chiplet GPUs make sense or don't is silicon defect density. There have been serious concerns that as we scale to smaller litho, defect density grows until large chips become unmanufacturable. If you expect that, splitting your design into many smaller chiplets you can test separately can make it a lot more economical to make.

But TSMC keeps knocking it out of the park with defect density, meaning that there is little demand for splitting up the GPU.
Or, to scale past exposure field size limits, with high NA reducing it further.
 

basix

Senior member
Oct 4, 2024
276
549
96
Tangent about NNs and FSR
Another reason to think AMD uses the crackpot LDS scheme is that NN's have very high replication ratios (duplicative data). For a single CU tile evaluator it's not an issue with WGP takeover, but with many CUs shared effective MB will be bogged down by duplicative data.
Shared global L1 enables CNN tile size to grow linearly with the number of CUs in a Neural Array. For HW good luck trying to run it. Completely essential for enabling high performance large tile CNNs when shared L1 speedup directly respond to the replication ratio.
That is exactly the reason why I also expect that "Neural Arrays" is some sort of cache / data sharing across a shader engine. Huynh described it like that as well. Cerny underpinned it with tile / screen portion sizes.
And as DNN execution gets more and more prevalent (gaming and especially HPC) it makes 100% sense to adapt your GPU architecture in that direction.
That trend is not new. DLSS2 was released in 2020. Other various forms of DNN for gaming acceleration were actively researched by many. V100 and A100 were good sellers even before 2023 ChatGPT.

Some application might not like the added latency. But very most should be fine and DNN will benefit exceptionally well. Something like DynEB would be a technique to mitigate the downsides and still maintain the majority uf upsides.

Even if this is off by a mile GATE is still a gamechanger for neural rendering, and I don't think RDNA 5 will disrupt this neural vs PT situation. Especially not if Neural Arrays and shared global l1 can deliver large increase on top other ML gains.

When neural rendering looks better than ReSTIR AND runs much faster, why even bother implement PT. At this rate bleeding edge PT in new releases will be dead before PS6 even launches. This paper can even do caustics, but it's not quite there.
Under the hood of Neural DI is still pathtracing. But you can augment PT with DNNs. Most practical and near/mid term neural rendering techniques deal with augmentation of existing base algorithms and not fully DNN generated content.

The nice thing about shared / global L1$ is, that also RT/PT should benefit from that. You could potentially fit the whole BVH into your SE and not need to fetch data from L2/L3.
 
Last edited:
  • Like
Reactions: MrMPFR

adroc_thurston

Diamond Member
Jul 2, 2023
7,782
10,486
106
That is exactly the reason why I also expect that "Neural Arrays" is some sort of cache / data sharing across a shader engine. Huynh described it like that as well. Cerny underpinned it with tile / screen portion sizes.
it's just dsmem. boring.
 
  • Like
Reactions: Kepler_L2

soresu

Diamond Member
Dec 19, 2014
4,179
3,649
136
This paper can even do caustics, but it's not quite there.
There is a ReSTIR variant paper that can do caustics at playable frame rates too...


 

MrMPFR

Member
Aug 9, 2025
139
278
96
And as DNN execution gets more and more prevalent (gaming and especially HPC) it makes 100% sense to adapt your GPU architecture in that direction.
Yeah. Trends + the +5 year old simulation results = dumb to not implement it.

DNN will benefit exceptionally well
Indeed, but it's not just DNN but really anything dominated by GEMM, including TFs.

Correcting myself: The MLP speedup is not dependent on model size, but replication ratio and math used. So SN = MLP perf is wrong and I have no idea where MLP lands between Transformer and DNN in speedup.

Under the hood of Neural DI is still pathtracing. But you can augment PT with DNNs. Most practical and near/mid term neural rendering techniques deal with augmentation of existing base algorithms and not fully DNN generated content.
Yes you're right. Input for training is indeed PT, but thinking it's greatly diminished, perhaps more than NRC. No details unfortunately. Thinking that's the case with NAO as well so GATE guesstimation prob still within reason.

Don't think DNNs are used at all. CNN + ViT or pure ViT for global image processing (SR, FG, RR) and MLPs needed for PT illumination pipeline.

MLP + PT lite input it seems.

The nice thing about shared / global L1$ is, that also RT/PT should benefit from that. You could potentially fit the whole BVH into your SE and not need to fetch data from L2/L3.
C-Ray didn't like shared L1, but maybe a hybrid scheme with local private L0 partition for RT core data (DGF nodes, BVH treelet etc..), + rest of BVH shared across SE.

it's just dsmem. boring.
Kepler liked so no ML centric shared L1 cache redesign in RDNA 5. No 2-4x DNN and +2X GEMM speedup. What a shame.

There is a ReSTIR variant
That's pretty cool. Just mentioned it to illustrate that no illumination effect is off the table. Everything could be almost replaced by MLP and AMD pioneering here. From complacent to gobbling up Intel and NVIDIA researchers and pushing the envelope for neural rendering. Who would have thought.
 

basix

Senior member
Oct 4, 2024
276
549
96
Cadence showed LPDDR6/LPDDR5X dual-use PHY & Controller, design ready for TSMC N3P:
Design-in ready IP, including HBM4 and LPDDR6/5x on TSMC N3P, enables next-generation AI infrastructure
I assume AMD uses something like that for RDNA5 and MI400 series. LPDDR6/LPDDR5X memory interface for RDNA5 and potentially a custom N3P HBM4 base Die.
 
  • Like
Reactions: Tlh97 and MrMPFR

techjunkie123

Member
May 1, 2024
157
340
96
Do all chip design companies outsource their memory interfaces? Just curious (from someone outside the industry) why AMD doesn't do these in house.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,683
2,571
136
Do all chip design companies outsource their memory interfaces? Just curious (from someone outside the industry) why AMD doesn't do these in house.
They are unlikely to get any competitive advantage without spending much more on R&D than it costs to license the IP. Why would they do it in house?
 
  • Wow
Reactions: techjunkie123

MrMPFR

Member
Aug 9, 2025
139
278
96
You do understand that ML kernels are usually shmem/TMEM-bound?
Indeed the case here in their follow up paper (Fig 1): https://adwaitjog.github.io/docs/pdf/decoupledl1-hpca21.pdf

For P-GEMM virtually zero impact of 16X L1 with private scheme. Only DNNs seems like a real outlier.

Also in old paper the larger L1 = less benefit from global shared l1. Caches used are not at all comparable to RDNA, or RDNA 5 for that. There's really too many unknows to draw certain conclusions, other than DNNs like shared L1.

L1 in this context would be the new unified L0/LDS of gfx1250+
Not sure it's quite the same. LLVM still talks about LDS + L0 vector memory. Also in paper there's only L1 data, texture and instruction cache sharing, not scratchpad. PSSR also used VGPR's, not L0+LDS.

Far too many unknowns and I'm done guessing. It's too confusing.
 
  • Like
Reactions: Elfear

Mopetar

Diamond Member
Jan 31, 2011
8,510
7,766
136
*attack the post and not the poster*
*wheeze*
*attack the post and not the poster*
*wheeze*

🤣

So . . . few . . . words?

But TSMC keeps knocking it out of the park with defect density, meaning that there is little demand for splitting up the GPU.

Defect density doesn't matter as much when there's so much redundancy baked in. Defects are effectively random but are they a problem for the PHYS parts of the chip in the same way that they disable logic or cache?

If the memory controllers aren't as susceptible to defects as other regions of the chip then there's not much that can bring down a huge chip these days. It's like watching a pack of wolves trying to take down a buffalo. It's just too massive despite the numerous wolves.

With the market as hot as it is, why make a new more complicated design when all you need is a two or three good dies to pay for the entire wafer. The rest is all gravy.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,683
2,571
136
Defect density doesn't matter as much when there's so much redundancy baked in. Defects are effectively random but are they a problem for the PHYS parts of the chip in the same way that they disable logic or cache?
Not every kind of defect can be harvested. If a transistor doesn't work, okay, just don't use the structure it's in and sell the product as a lower tier one. But if there is a direct short from power to ground, when you apply power to the chip it's going to melt. The ratio of defects related to wiring compared to transistors used to be very small, but has grown rapidly in the last few shrinks.

Would dsmem help also graphics workloads?
It does not help with traditional graphics workloads. It can be nifty for the kind workloads that are doing graphics as compute, that a lot of people are experimenting on.
 
  • Like
Reactions: MrMPFR

basix

Senior member
Oct 4, 2024
276
549
96
Indeed the case here in their follow up paper (Fig 1): https://adwaitjog.github.io/docs/pdf/decoupledl1-hpca21.pdf
I finally had the time to read the paper. Thanks for sharing :)

For me it seems, that the solutions of the 2020 and 2021 paper can be combined. The clustered and coupled L1$ (2021) should help the "DynEB" (2020) implementation. The dynamic sharing should get improved because the caches are coupled in a relatively close manner. That can already be seen in chapter 5.5 of the 2020 paper ("Crossbar-based Shared L1 Cache Design"). There they re-use the already existing work distribution crossbar to get additional speed-ups.

The 2021 paper mainly aims to reduce power and area of the Network-on-Chip (together with better cache and bandwidth utilization) whereas the 2020 paper mainly aimed for better utilization (which automatically yields higher efficiency).

The clustered design of the 2021 paper agains shows the tendency, that clustered sharing across max. a single shader engine yields best results. You now can augment it with DynEB and selectively add a "quasi-private" path through the crossbar / with a separate path (reduced latency) and/or you do not reduce the peak bandwith (e.g. 8x8 crossbar instead of 8x4 crossbar, this will add little bit area and power but according to the paper not much). P-GEMM benefited very much from the 2020 paper solution but not from the one of 2021. So there are for sure some tradeoffs. But maybe combining both yields in the best of both worlds. Some of the differences of P-GEMM probably result from the different base interconnect (2020 = 6x6 mesh, 2021 = 80x32 crossbar) and L2$ sizes (2020 = 1MB, 2021 = 4MB). P-GEMM benefitet much from 16x L1$ in the 2020 paper but not the 2021, altough L1 cache sizes are the same.

Something between C5 and C10 (16 or 8 CU per cluster) seemed to yield best results regarding performance, chip area and power efficiency. The rumored 12 CU per RDNA5 SE seem to be the sweet spot. The last not optimal performing applications can be covered with bigger L1 caches overall (where e.g. C1 or 80 CU per cluster with one huge shared cache yielded best results), DynEB (reduced latencies) and increased peak bandwidth (wider crossbar). The bigger L2$ of RDNA5 would for sure help as well and its interconnect could be used as a secondary / hieararchial and true chip global cache sharing network. So not writing back to L2$ but directly share between SE L1 cache clusters, but not sure if L2 interconnect & bandwidth pressure will allow that (but the 2021 paper shows some form of that as well). But it would reduce data duplication between L1$ and L2$. You have many knobs to tune here, but I think that such a shared L1$ topology has many benefits. It is a bummer that no gaming workloads were tested in these two papers. In the end it would not matter, if power and area would stay the same with all additions and compromises, when average application performance gets boosted by a good amount. But I assume AMD would aim for a balanced approach, where power and area gets reduced for all applications while gaining performance for most applications. Gaming would benefit from reduced power draw, ML/AI from increased performance.

It does not help with traditional graphics workloads. It can be nifty for the kind workloads that are doing graphics as compute, that a lot of people are experimenting on.
Would be interesting to see, yes. All new console generations resulted in new optimizations and paradigm shifts. Such a shared L1$ with Work Graphs and Neural Rendering might just be the next thing.
 
Last edited:
  • Like
Reactions: Tlh97 and MrMPFR

adroc_thurston

Diamond Member
Jul 2, 2023
7,782
10,486
106
Would be interesting to see, yes. All new console generations resulted in new optimizations and paradigm shifts. Such a shared L1$ with Work Graphs and Neural Rendering might just be the next thing.
Crossgen periods kill whatever you think gonna happen.
 

MrMPFR

Member
Aug 9, 2025
139
278
96
Thanks for the detailed paper analysis @basix.
Here's an older 2019 paper that looks related to AMD's DSMEM implementation: https://adwaitjog.github.io/docs/pdf/remote-core-pact19.pdf

Another difference between papers is 28 CUs vs 80.
On YT researchers compared 2020 (private and shared) vs 2021 (shared only) as difference between one building where everyone has their own kitchen and can share ingredients, and the other as one big shared kitchen. Indeed not apples to apples.

Something between C5 and C10 (16 or 8 CU per cluster) seemed to yield best results regarding performance, chip area and power efficiency. The rumored 12 CU per RDNA5 SE seem to be the sweet spot.
There's actually patent for adaptive CU clustering: https://patents.google.com/patent/US11360891B2
Also simulations are pretty much based on GCN CU. RDNA split L0's merged into one it and would align with PR40 in 2021 paper, except for the cache size differences. RDNA 5 SE = 12/24 CUs, except for AT02. Clustering could be C6, C12, C8 maybe even irregular clustering perhaps based on some clever heuristic or ML block to maximize performance.

Maaybe the P-GEMM speedup in 2020 paper was a result of reaping the benefits from private and global at the same time?

The last not optimal performing applications can be covered with bigger L1 caches overall
If shared GFX12.5+ LDS+L0 slab is configurable, then L0 size could be massive as long as this doesn't conflict with LDS.

The bigger L2$ of RDNA5 would for sure help as well and its interconnect could be used as a secondary / hieararchial and true chip global cache sharing network. So not writing back to L2$ but directly share between SE L1 cache clusters, but not sure if L2 interconnect & bandwidth pressure will allow that
In 2021 paper there seems to be an inverse correlation between L2 reply BW and shared L1 communication. In other words the more the application benefits from shared L1 (reduced l1 misses) the fewer requests are forwarded to the L2. Maybe the freed up BW in NoC is enough to satisfy additional L1-L1 communications.

But I assume AMD would aim for a balanced approach, where power and area gets reduced for all applications while gaining performance for most applications. Gaming would benefit from reduced power draw, ML/AI from increased performance.
Sounds about right. BTW here's another interesting patent for dataflow execution: https://patents.google.com/patent/US11947487B2.
As prev stated there are many more patents that alter data management, cachemem, execution, and scheduling characteristics, at least some of which should find their way into actual products. Perhaps as soon as RDNA 5 if we're lucky.

Interesting optimisations
Another new cache system optimisation paper (2021): https://jbk5155.github.io/publications/MICRO_2021.pdf
- Idle unused I-cache and LDS capacity can now be used for TLB. There's a massive boost IPC (>100%) in some applications and on average 30.1%. Paper also highlights the current issues with irregular memory access patterns and how this new approach counters that. Assume this would be beneficial to ray tracing and other irregular applications? If not in RDNA 5 then later µarchs. The increased flexibility reminds me of another patent I shared: https://patents.google.com/patent/US12265484B2
This one sounds a lot like flexible cache in M3 and later. VRF is a cache, everything can be reconfigured to a different cache type depending on the specific needs of workloads executed.

And why stop there. Apparently even registers can bypass L2 for inter-register data requests beyond WGP takeover mode employed in PS5 Pro. Here it looks like a GPR analog to DSMEM (not an AMD patent): https://patents.google.com/patent/US20210398339A1/en
Though I'm not sure if true LDS + VGPR global sharing a la L1 2020 and 2021 paper (replication mitigation) is even possible, mayb some hybrid private global implementation?

With SRAM scaling pretty much dead business won't work. So really not surprised that AMD seems to be actively investigating basically every avenue to squeeze as much performance out of limited SRAM capacities as possible. It'll be interesting to see how this materializes in RDNA 5 and later.
 
Last edited: