Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 65 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MrMPFR

Member
Aug 9, 2025
118
250
76
The combo of nanite with RT has wrecked many games

Plus Lumen implementation produces a result that is very hard to optimize for low end (SVOGI - Voxel Cone Tracing gives much better bang for buck)
Nanite (AC Shadows Geo =/= Nanite) is only used in UE5 games and most UE5 games don't bother with RTXGI, PTGI or HW lumen. RT is not the problem. For example DDGI is blazing fast, but only covers diffuse lighting. For GI SW Lumen is an inferior and slow SW solution that tries to do everything (GI and reflections). HW version looks much better but still heavy.

Yeah but SVOGI only covers diffuse lighting and the lighting in KCD 2 or Crysis Remastered doesn't look close to SW Lumen in UE5 titles or DDGI in titles such as Metro Exodus Enhanced Edition. DDGI > SVOGI

They heavily customize UE5 for their two games, unlike some other devs. AFAICT no Lumen and Nanite in Arc Raiders or The Finals, some serious in-house engine tweaking, and 2016 midrange GPU on min specs. Not surprising they run well. IIRC both use DDGI (RTXGI) and a fallback GI solution.

Probably as a rule of thumb the more customized the UE5 version in a game is the better it'll run. Engineers always wanna tweak and optimize stuff. TW4 prob first UE5 game using the latest tech AND basically running flawlessly. TurboTECH FTW!

They must do two things:
1) get to UE6 real quick because UE5 is now more or less toxic keyword, new games that are well made using it are better off stop saying which engine they've got
2) they need to fix upgrade situation - games dev who start making a game on major version X should be able to upgrade seamlessly to a minor version, otherwise it's total BS
1) Not gonna happen when UE6 isn't launching anytime soon, Sweeney said release (preview) in a few years in Spring, so ~2030 release, 8 years after UE5 release. Yeah but does that really carry over to the average PC gamer and console gamers? He also said they're going to abandon the old code completely and rewrite everything to be multithreaded and that's not easy. Similar to what Unity did with DOTS but I suspect more profound changes including leveraging Work Graphs (a big deal for PS6 and RDNA 5) for virtually everything based on Epic's public statements in 2024. UE6 mass adoption in early to mid 2030s could be when RDNA 5 really begins to shine.
2) That is impossible considering how much they change with each release. But every single serious AA and AAA dev should commit to all the UE5.6+5.7 experimental stuff right away. UAF, FastGeo, Nanite Foliage...

This is an early adopters phase. UE4 all over again. Give it a few more years and by 2028 post PS6 launch a lot of new UE5 games will leverage all the experimental UE5.6 tech to eliminate traversal stutters and just run much better overall. By then the HW will be more capable (fingers crossed RDNA 5 and Rubin are good) and Advanced Shader Delivery will be pervasive.

Not specific to RDNA 5 or any graphics card, but just the current trajectory pushed by the incumbent powers that be - Read Epic, Nvidia, and graphics built on Unreal Engine.
This is a RDNA 5 thread so please post this somewhere else in the future or don't.
 

MrMPFR

Member
Aug 9, 2025
118
250
76
Found a proposed explanation for why Sony bothered with Neural Arrays here:

"As I mentioned in another thread, it appears like the Neural Arrays solution likely is a means of providing groups of CUs that have additional circuits that can either passively work like the PS5/Pro or can treat the array of CU registers as sharing a memory address space, so that tensor tiles bigger than an individual CU register memory's L1 cache can be spanned across the CU's by a higher level Neural Array controller and eliminate a lot of the 40-70% wasted tile border processing (TOPs) that PSSR on PS5 Pro suffers from in the PS5 Pro technical seminar video at 23:54.

By allowing for much larger tiles via Neural Arrays the hardware could either be retasked to a Transformer model like DLSS4 or would already be operating on such large titles at lower resolution tensors that the holistic benefits of Transformers would already be achieved by the CNNs.

Assuming a Neural array tile was already big enough for a full 360x240 tensor to fit. If the Array was able to work like I'm guessing it would effectively be processing an entire Mip of the whole scene all at once."



As per SIE Road to PS5 Pro vid it has WGP takeover mode to process one tile per WGP. With RDNA 5 AMD took the next logical step which was to implement takeover mode at the SE level and process tiles not on a per WGP/CU basis but on a Neural Array basis.
I think this is the patent for WGP takeover mode: https://patents.google.com/patent/US12033275B2
Wonder if this takeover mode is a PS5 Pro customization or in RDNA 4 as well?

Unhinged speculation but if scheduling, synchronization, and control logic is relegated to higher level anyway (Shader Engine), AMD could decouple the the ML logic completely from the current four SIMDs within a WGP and merge it into one giant systolic array per WGP/CU. With AMDFP4 (they need their own answer to NVFP4) and doubled FP8 throughput (4X/WGP) prob 16 times larger FP4 WGP level systolic array than RDNA 4's FP8 SIMD level systolic array. In effect something like the systolic array found in a DC class Tensor core or a NPU.
Doing this SIMD decoupling would require massive cache system changes. Perhaps with some tweaks to this patent AMD could implement a scheme where the Systolic Array gobbles up the most of or entire LDS+L0 and VGPR and allocates it as a giant shared Tensor memory or a combination of this and private data stores. RDNA 4 has 4 x 192kB VGPR + 1 x 128kB LDS + 2 x 32kB L0 = 960kB maximum theoretical Tensor Memory per WGP/CU. Possible RDNA 5 is even larger if VRF and LDS+L0 gets bigger with GFX13.
To connect it all together implement relevant SE level logic AND a inter-WGP/CU fabric and process enormous FSR5 tiles on a per Neural Array basis.

Sounds cool but prob not happening.

Whatever ends up happening still a shame DF latest vid didn't pry Cerny on this. Some clarification could've been nice. All we got was:
"Neural Arrays will allow us to proces a large chunk of the screen in one go, and the efficiencies that come from that are going to be a game changer as we begin to develop the next generation of upscaling and denoising technologies together."
 
Last edited:

MrMPFR

Member
Aug 9, 2025
118
250
76
gfx13 has it because gfx1250 has it.
2022: Hopper Adds DSMEM + TBC
2022: Ada Lovelace ignores^
2025: Blackwell consumer ignores ^^

AMD could've chosen to cut it like from consumer like NVIDIA (Kepler confirmed it's not on consumer) but opted to include it anyway.
Kepler is wrong. @adroc_thurston is correct. Blackwell consumer has DSMEM + TBC.

Cerny already said the point was larger tiles. I assume this is targeting mostly the CNN portion of FSR5 assuming it sticks with FSR4 Hybrid CNN+ViT design. Working on a larger tile is effectively a larger "context window" = improved fidelity + also less wasted tile border processing.
That's the idea. Then how it carries over to the actual FSR5 implementation who knows.
 
Last edited:

adroc_thurston

Diamond Member
Jul 2, 2023
7,213
9,983
106
2022: Ada Lovelace ignores^
because it was sm89.
2025: Blackwell consumer ignores ^^
it does have dsmem actually.
NVIDIA (Kepler confirmed it's not on consumer)
It in on consumer, see CUDA cc12 feature compatability matrix.
1760210161702.png
Cerny already said the point was larger tiles. I assume this is targeting mostly the CNN portion of FSR5 assuming it sticks with FSR4 Hybrid CNN+ViT design. Working on a larger tile is effectively a larger "context window" = improved fidelity + also less wasted tile border processing.
That's the idea. Then how it carries over to the actual FSR5 implementation who knows.
?
the point is that you get accelerated shmem transfers.
Without hammering the L2.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
118
250
76
Thanks for the screenshot of the table. Really impressive they managed to cram all this new tech into GB206 die that's still smaller than AD106, Tons of low level optimizations and/or the silicon overhead is just minimal.

?
the point is that you get accelerated shmem transfers.
Without hammering the L2.
Yeah my explanation isn't great. Maybe someone else can explain it better.

Sure but you can't do the massive image processing tiles Cerny talked about without that. Just not feasible.

A related quote in case anyone is interested: "DSMEM enables more efficient data exchange between SMs, where data no longer must be written to and read from global memory to pass the data. The dedicated SM-to-SM network for clusters ensures fast, low latency access to remote DSMEM. Compared to using global memory, DSMEM accelerates data exchange between thread blocks by about 7x."
- From https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
 

Panino Manino

Golden Member
Jan 28, 2017
1,144
1,383
136
Won't argue about Threat Interactive, but I have to say, it's hard for me to understand something.
Why? With the flow of time, even as more and more processing power and memory becomes available, why does game graphics seem to have more and graphical compromisses and compromisses?
It always seems that more and more and required to do the same things that were perfected before.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,076
3,908
136
Won't argue about Threat Interactive, but I have to say, it's hard for me to understand something.
Why? With the flow of time, even as more and more processing power and memory becomes available, why does game graphics seem to have more and graphical compromisses and compromisses?
It always seems that more and more and required to do the same things that were perfected before.
i think the only real regression has been deferred rendering , hopefully at some point we can get back to some form of forward rendering that supports large number of light sources. Then we can have real AA again.
 

soresu

Diamond Member
Dec 19, 2014
4,128
3,587
136
And the fact modern games look bad is just me imagining things?
Look bad?

Doom Dark Ages looks great IMHO, a significant step up from Doom Eternal.

Though clearly good art direction and/or cinematography is a big part of it, as the later addition of path tracing makes little difference to the cut scene visual fidelity.

Merely having a great rendering engine isn't nearly so good as having a director who actually knows what they are doing with it.

I doubt that merely putting Doom Eternal assets in the id Tech 8 engine would be anywhere near as effective.
 
  • Like
Reactions: Tlh97 and MrMPFR

soresu

Diamond Member
Dec 19, 2014
4,128
3,587
136

MrMPFR

Member
Aug 9, 2025
118
250
76
Post was too long so pushed in depth reporting to this Google Docs. Unlike my AMD patent docs I promise this stay up forever:

Intro and Disclaimer (Skip if you like)

Last major post discussed RDNA 5's schedule and dispatch changes. This one selectively addresses RDNA 5's potential RT, and significantly expands info disclosed from my previous posts here and elsewhere as well as the stuff disclosed by reputable leakers such as Kepler_L2. The scope of proposed changes is massive which explains the length of the linked Docs. Probably takes 10-15 minutes to read it, but I've summarised the most important insights here for your convenience.

Analysis is patent derived so usual the usual caveats apply. Not confirmed yet but it's very likely all things considered and while he exact implementations may not mirror patents 1:1 the impacts should be roughly the same. But we still need confirmation from a reputable leaker like Kepler_L2 to be certain what is and isn't in RDNA 5.

I'm sorry Kepler for my old ignorant comments, and I've tried my best to address these in the docs and I'll link them below.

The Good Stuff:

I know I've talked about thes RDNA 5 RT changes before and even at length but after reading the patents again, properly this time, the changes now appear much more profound in effect and scope, easily enough warrant a summary:
  1. Pre-Filtering Pipeline: Implements very wide parallel low-precision (Integer) intersection testers (pre-filtering) that mass cull triangles in DGF/prefiltering nodes and DMMs. These units have a tiny area overhead and low latency and as a result perf/area can be massively increased. They also reduce cachemem load due to many factors including less control circuitry, since Integers are much easier to process than floating points, and also a reduced precision scheme of almost half-precision integer math (~INT16 or more precisely Q+3) vs full-precision FP math used (FP32).
  2. Integer dominates FP: By default INT tests all boxes/tris/primitives in a node. If one of these is inconclusive the one traditional FP test is used to confirm the results. In some instances when the increased fideliy isn't needed or can't be appreciated (too far off in the distance) FP tests are actually never required providing an even further speedup.
  3. Very wide BVH: Since the INT pipeline has a tiny cachemem load and area cost per ray intersection, very wide and shallow BVHs can be used. BVH8-16 is discussed in patents, but maybe it will be even wider.
  4. Versatile Pre-filtering: Pre-filtering can be used for ray intersection tests against all sorts of primitives, including linear swept spheres, quads, bounding boxes. Only limited by the HW used.
  5. Benefits of DGF and DMM : DGF and DMMs both have lower cachemem overhead (footprint, BW and circuitry load). Both boost performance on their own without pre-filtering.
  6. Always cache aligned: GFX13's RT pipeline strives to bundle geometry into cache aligned fixed size data structures whether DGF is used or not. It uses these to reduce memory transactions and load on cachemem system.
  7. DGF Fallback method: When DGF hasn't been implemented by a dev for one asset a fallback method called pre-filtering nodes used. I call it DGF Lite since it only compresses vertices.
  8. Less decode and data prefetching: With pre-filtering the decoding overhead from DGF and pre-filtering nodes can be reduced since less data also needs to be fetched and stored. Full prcision data is newer prefetched and only fetched when the pipeline requires a floating point test.
  9. Novel DMM encoding: DMM encoding scheme replaces 64 subdivided triangles with 14 and can be evaluated with a single traversal step and BVH14 node. In contrast to previous methods reliant on three traversal steps using BVH4 and two traversal steps using BVH8.
  10. Prism Volume HW: Dedicated Bounding Circuitry in RT cores construct prism volumes to accelerate DMM evaluation.
  11. Math enabling prefiltering: Various precomputations enable low-precision tests at ray setup are run at BVH build instead of at runtime, which finally makes pre-filtering feasible.
  12. Quantized OBBs: OBBs using platonic solids to quantize them which enables prefiltering of ray/box intersections using oriented boundin boxes (OBBs).
  13. Goated CBLAS: DGF and Prefiltering nodes are basically made for a compacted CBLAS BVH architecture. With DMM on top this is even more insane as we can see up to ~16,400 times reduction in the number of leaf nodes compared to conventional method. Achiees more than an order of magnitude reduction in BVH footprint vs RTX Mega Geometry at iso-geometric complexity.
  14. Less redundant math: Reduced Configurable inside/outside ray/edge test sharing directly benefits from DGF adoption, but can work with all kinds of geometry. Pre-filtering provides further speedup here.
  15. The Holy Grail - Partial Ray Coherency Sorting: Ray coherency sorting for leaf nodes is achieved by coalescing rays against the same DGF/prefiltering node. Then the pipeline executes them all at once within a RT core before switching to the next node. Pipeline exploits spatial coherency to deliver unprecedented (except for PowerVR Photon) scheduling and data coherency, allowing for superior data locality and reuse. In order words a massive cachemem load reduction and speedup for RT leaf node evaluations.
  16. DGF + DMM Ray Coherency Sorting: With DGF/Prefilter nodes + DMMs the scope of ray coherency sorting can be expanded to cover more of BVH and may be implemented an additional time at DMM base triangle to minimize load on the cachemem system even futher. It also eases the load on the Bounding Circuitry for Prism Volumes by avoiding duplicative builds.

Implications for Nextgen​

With RDNA 5 it appears AMD has managed to design a novel and groundbreaking ray tracing pipeline. It's a monumental leap over RDNA 4's pipeline that easily qualifies as a clean slate. Note this conclusion was even derived from an incomplete analysis. There are many more public and likely soon to be published patents that will further expanding the scope of changes further solidifying this excellent architecture.

This shows AMD's architectural team is extremely talented. The changes are not about brute forcing the problem by mindlessly throwing more logic and cache at the problem; they are about redesigning the entire pipeline from scratch with ingenious optimisations derived from first principles thinking. The results are as expected: RT in RDNA 5 appears mighty impressive.

If we compare against the competition GFX13 RT is well ahead of 50 series in architectural sophistication; likely multiple gens of leapfrogging with NVIDIA's usual cadence. So unless Rubin is a massive leap as well AMD will easily have the architectural upper hand. But in the end this is just one side of the coin since area investment is equally important, so Rubin remains a joker. But if they loose in Ray tracing then they absolutely need to find a new thing to chase.

Addressing prev ignorant comments​

- Then lists the three patents related to an AMD DMM implementation as beyond current µArch, when DMM has been supported on RTX 40 series since 2022.
How is AMD's implementation beyond Ada's DMM decompression engine (Blackwell removed it)?
I'm sorry Kepler for not bothering to actually read the patent. The leapfrogging is obvious and significant.

It's just old boring DGF. Really hoped for more in RDNA 5 even if it's still beyond Blackwell.
I'm taking that back when DGF is actually amazing, especially when you built an architecture around it. Cerny is 100% correct when he said that DGF enables flexible and efficient data structures. It keeps as much data wrapped into cache aligned fixed size packages.
I was just expecting additional changes related to data structures like overlay trees and delta instances in HW.

Edit: @vinifera found another patent that is actually related to the new geo encoding scheme beyond DMMs (see Docs).
Second Edit: Formatting, rephrasing for better reading experience, improved info in point list and summary + moved in-depth to Docs.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
118
250
76
@vinifera you're free to steal this formatting an include in your post. Makes it easier to reference later. I'll delete when it's moved:
END
---------------------------

Reporting and analysis​

#1: Looks like deadlock and long latency stall mitigation that can make GPUs more versatile (i.e. supporting more application types). Introduces fine-grained context saves and restores on GPU down to wavefront level.
Might be related to this patent, that sounds a lot like Volta's Independent Thread Scheduling: https://patents.google.com/patent/US20250306946A1
- Guessing this is CDNA 5 related.

#2: A Shader Engine level payload sorting circuit coupled to the Work Graph Scheduler. Might also be implemented at CU level. It is a specific HW optimization for work graphs independent of compute units. It "...improves coherency recovery time by sorting payloads to be consumed by the same consumer compute unit(s) into the same bucket(s). The producer compute units are able to perform processing while the sorting operations are being performed by the sorting circuit in parallel."
While the main target is work graphs the technique "...applicable to other operations, such as raytracing or hit shading, and other objects, such as rays and material identifiers (IDs)." Complementary to the Streaming Wave Coalescer.
- Since they mention rays it's very possible that this unit is responsible for the ray coalescing against DGF nodes that I described earlier. Very likely a RDNA 5 patent. Chajdas is involved and once again this optimization is crucial for Work Graphs.

#3: This allows a ressource for a second task to be assessed in advance without interfering with first task. It's as follows: execute first task, then initiate second task, but pausing before accessing said ressource, and if ressource for second task is ready after completion of first task then it gets executed. Looks like this is implemented at the Shader Engine level. The patent states: "...sequential tasks can be executed more quickly and/or GPU resources can be utilized more fully and/or efficiently."
- Not sure about this one, but could end up in RDNA 5 or perhaps CDNA 5.

#4: A method of animated compressed geometry that's based on curved surface patches. This is related to the beyond DMM patent I discussed in prev post.
- Looks too novel to be in RDNA 5 + no HW blocks specified. Gruen is the sole originator.

#5: A method of deferring any hit shader execution until which makes it"...possible to group the execution of an any hit shader for multiple work-items together, thereby reducing divergence."
- This is a big deal, possibly even bigger than SER if they can make the any hit shader evaluation very coherent. NVIDIA said this at the launch of SER: "With increasingly complex renderer implementations, more workloads are becoming limited by shader execution rather than the tracing of rays." Until fairly recently I thought SER was for coalescing ray tracing operations. Yeah I know it's stupid.
- This patent has McAllister listed alongside many researchers. Has to be in RDNA 5 since not including it would be asinine.

#6: This looks like the technique behind the Animated DMM GPUOpen paper unveiled at Eurographics 2024 and shared by @basix.
- I don't see specific HW mentions of logic for the animated DMMs beyond basic DMM HW pipeline, but AMD needs this or a better approach because the paper stated that it on RDNA 3 has "...∼ 2.3−2.5× higher ray tracing costs than regular ray tracing of static DMMs." Gruen is the sole originator.

What can we expect?

#2 and #5 are most important and will almost certainly end up in RDNA 5 on top of what I previously discussed in my last comment. It strongly implies their GFX13 RT implementation is leapfrogging NVIDIA Blackwell by several gens, well at least in sophistication. AMD could decide to just gimp RT cores to save on die space, but overall it looks like AMD might turn the tables against NVIDIA in RT nextgen. Rubin is still a joker so anything could happen and we'll see.
If they loose NVIDIA will prob go: "RT is for console peasants, now here's a selection of generated AI games that can run on the new 6090 at 20 frames per second. We use DLSS and MFG to run it at 120 FPS xD." or "Now our tensor cores are so powerful that we can replace most of the ray tracing pipeline and it looks better."

Regardless not surprised AMD and Sony is openly talking about path tracing on future HW when the pipeline looks this capable. Hope they resist temptation offsetting architectural sophistication with less HW by of cutting it down because it's "good enough". It can be amazing it they let it shine.

hey that's normal, Intel GFX R&D guys got swallowed whole by AMD.
Think we're beginning to see the results of that in patent filings rn.

Looks like RDNA5 def won't be short of paradigm shifts and novel ideas.
 
Last edited:

soresu

Diamond Member
Dec 19, 2014
4,128
3,587
136
Guessing this is CDNA 5 related
Going forward it would seem CDNA -> RDNA -> CDNA -> RDNA.

At least as far as the CU goes, so unless the patent specifically talks about matrix cores you can assume it's likely going to end up in RDNA too if it ends up in anything.
 
  • Like
Reactions: Tlh97

marees

Golden Member
Apr 28, 2024
1,783
2,399
96
Going forward it would seem CDNA -> RDNA -> CDNA -> RDNA.

At least as far as the CU goes, so unless the patent specifically talks about matrix cores you can assume it's likely going to end up in RDNA too if it ends up in anything.
I am excited about the CCU patent but it looks like CDNA stuff for now. Atleast it doesn't seem to be in PS6
 

soresu

Diamond Member
Dec 19, 2014
4,128
3,587
136
If they loose NVIDIA will prob go: "RT is for console peasants, now here's a selection of generated AI games that can run on the new 6090 at 20 frames per second. We use DLSS and MFG to run it at 120 FPS xD." or "Now our tensor cores are so powerful that we can replace most of the ray tracing pipeline and it looks better."
There's nothing probable about it.

They are already laying the foundation for such a pivot with all their neural rendering language in PR for RTX 50.
 
  • Like
Reactions: Tlh97 and MrMPFR

tsamolotoff

Senior member
May 19, 2019
258
510
136
And the fact modern games look bad is just me imagining things?
No, it's simply that some people don't really see all the noise and blur and praise TAA and its derivatives despite the fact that they destroy image clarity, small details and make your head spin with the ghosting in motion. This alone makes me not play the modern games if there is no way to disable temporal stuff. What is the point of fancy 'realistic' (tm) (C) lightning (which means fractions of ray samples per pixel temporally smeared and accumulated) if you can't see anything on the screen and your eyes start bleeding after one minute or so (Talos principle 2 is an egregious example of this)
 

tsamolotoff

Senior member
May 19, 2019
258
510
136
Doom Dark Ages looks great IMHO, a significant step up from Doom Eternal.
At a cost of literally 10x fps (and 5x if I enable raytracing in this scene), is it really worth it (for us users, it's obvious that full RT mode allowed ID to save lots of bux on development process)

doom.jpg