Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 65 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MrMPFR

Member
Aug 9, 2025
116
236
76
The combo of nanite with RT has wrecked many games

Plus Lumen implementation produces a result that is very hard to optimize for low end (SVOGI - Voxel Cone Tracing gives much better bang for buck)
Nanite (AC Shadows Geo =/= Nanite) is only used in UE5 games and most UE5 games don't bother with RTXGI, PTGI or HW lumen. RT is not the problem. For example DDGI is blazing fast, but only covers diffuse lighting. For GI SW Lumen is an inferior and slow SW solution that tries to do everything (GI and reflections). HW version looks much better but still heavy.

Yeah but SVOGI only covers diffuse lighting and the lighting in KCD 2 or Crysis Remastered doesn't look close to SW Lumen in UE5 titles or DDGI in titles such as Metro Exodus Enhanced Edition. DDGI > SVOGI

They heavily customize UE5 for their two games, unlike some other devs. AFAICT no Lumen and Nanite in Arc Raiders or The Finals, some serious in-house engine tweaking, and 2016 midrange GPU on min specs. Not surprising they run well. IIRC both use DDGI (RTXGI) and a fallback GI solution.

Probably as a rule of thumb the more customized the UE5 version in a game is the better it'll run. Engineers always wanna tweak and optimize stuff. TW4 prob first UE5 game using the latest tech AND basically running flawlessly. TurboTECH FTW!

They must do two things:
1) get to UE6 real quick because UE5 is now more or less toxic keyword, new games that are well made using it are better off stop saying which engine they've got
2) they need to fix upgrade situation - games dev who start making a game on major version X should be able to upgrade seamlessly to a minor version, otherwise it's total BS
1) Not gonna happen when UE6 isn't launching anytime soon, Sweeney said release (preview) in a few years in Spring, so ~2030 release, 8 years after UE5 release. Yeah but does that really carry over to the average PC gamer and console gamers? He also said they're going to abandon the old code completely and rewrite everything to be multithreaded and that's not easy. Similar to what Unity did with DOTS but I suspect more profound changes including leveraging Work Graphs (a big deal for PS6 and RDNA 5) for virtually everything based on Epic's public statements in 2024. UE6 mass adoption in early to mid 2030s could be when RDNA 5 really begins to shine.
2) That is impossible considering how much they change with each release. But every single serious AA and AAA dev should commit to all the UE5.6+5.7 experimental stuff right away. UAF, FastGeo, Nanite Foliage...

This is an early adopters phase. UE4 all over again. Give it a few more years and by 2028 post PS6 launch a lot of new UE5 games will leverage all the experimental UE5.6 tech to eliminate traversal stutters and just run much better overall. By then the HW will be more capable (fingers crossed RDNA 5 and Rubin are good) and Advanced Shader Delivery will be pervasive.

Not specific to RDNA 5 or any graphics card, but just the current trajectory pushed by the incumbent powers that be - Read Epic, Nvidia, and graphics built on Unreal Engine.
This is a RDNA 5 thread so please post this somewhere else in the future or don't.
 

MrMPFR

Member
Aug 9, 2025
116
236
76
Found a proposed explanation for why Sony bothered with Neural Arrays here:

"As I mentioned in another thread, it appears like the Neural Arrays solution likely is a means of providing groups of CUs that have additional circuits that can either passively work like the PS5/Pro or can treat the array of CU registers as sharing a memory address space, so that tensor tiles bigger than an individual CU register memory's L1 cache can be spanned across the CU's by a higher level Neural Array controller and eliminate a lot of the 40-70% wasted tile border processing (TOPs) that PSSR on PS5 Pro suffers from in the PS5 Pro technical seminar video at 23:54.

By allowing for much larger tiles via Neural Arrays the hardware could either be retasked to a Transformer model like DLSS4 or would already be operating on such large titles at lower resolution tensors that the holistic benefits of Transformers would already be achieved by the CNNs.

Assuming a Neural array tile was already big enough for a full 360x240 tensor to fit. If the Array was able to work like I'm guessing it would effectively be processing an entire Mip of the whole scene all at once."



As per SIE Road to PS5 Pro vid it has WGP takeover mode to process one tile per WGP. With RDNA 5 AMD took the next logical step which was to implement takeover mode at the SE level and process tiles not on a per WGP/CU basis but on a Neural Array basis.
I think this is the patent for WGP takeover mode: https://patents.google.com/patent/US12033275B2
Wonder if this takeover mode is a PS5 Pro customization or in RDNA 4 as well?

Unhinged speculation but if scheduling, synchronization, and control logic is relegated to higher level anyway (Shader Engine), AMD could decouple the the ML logic completely from the current four SIMDs within a WGP and merge it into one giant systolic array per WGP/CU. With AMDFP4 (they need their own answer to NVFP4) and doubled FP8 throughput (4X/WGP) prob 16 times larger FP4 WGP level systolic array than RDNA 4's FP8 SIMD level systolic array. In effect something like the systolic array found in a DC class Tensor core or a NPU.
Doing this SIMD decoupling would require massive cache system changes. Perhaps with some tweaks to this patent AMD could implement a scheme where the Systolic Array gobbles up the most of or entire LDS+L0 and VGPR and allocates it as a giant shared Tensor memory or a combination of this and private data stores. RDNA 4 has 4 x 192kB VGPR + 1 x 128kB LDS + 2 x 32kB L0 = 960kB maximum theoretical Tensor Memory per WGP/CU. Possible RDNA 5 is even larger if VRF and LDS+L0 gets bigger with GFX13.
To connect it all together implement relevant SE level logic AND a inter-WGP/CU fabric and process enormous FSR5 tiles on a per Neural Array basis.

Sounds cool but prob not happening.

Whatever ends up happening still a shame DF latest vid didn't pry Cerny on this. Some clarification could've been nice. All we got was:
"Neural Arrays will allow us to proces a large chunk of the screen in one go, and the efficiencies that come from that are going to be a game changer as we begin to develop the next generation of upscaling and denoising technologies together."
 
Last edited:

MrMPFR

Member
Aug 9, 2025
116
236
76
gfx13 has it because gfx1250 has it.
2022: Hopper Adds DSMEM + TBC
2022: Ada Lovelace ignores^
2025: Blackwell consumer ignores ^^

AMD could've chosen to cut it like from consumer like NVIDIA (Kepler confirmed it's not on consumer) but opted to include it anyway.
Kepler is wrong. @adroc_thurston is correct. Blackwell consumer has DSMEM + TBC.

Cerny already said the point was larger tiles. I assume this is targeting mostly the CNN portion of FSR5 assuming it sticks with FSR4 Hybrid CNN+ViT design. Working on a larger tile is effectively a larger "context window" = improved fidelity + also less wasted tile border processing.
That's the idea. Then how it carries over to the actual FSR5 implementation who knows.
 
Last edited:

adroc_thurston

Diamond Member
Jul 2, 2023
7,195
9,975
106
2022: Ada Lovelace ignores^
because it was sm89.
2025: Blackwell consumer ignores ^^
it does have dsmem actually.
NVIDIA (Kepler confirmed it's not on consumer)
It in on consumer, see CUDA cc12 feature compatability matrix.
1760210161702.png
Cerny already said the point was larger tiles. I assume this is targeting mostly the CNN portion of FSR5 assuming it sticks with FSR4 Hybrid CNN+ViT design. Working on a larger tile is effectively a larger "context window" = improved fidelity + also less wasted tile border processing.
That's the idea. Then how it carries over to the actual FSR5 implementation who knows.
?
the point is that you get accelerated shmem transfers.
Without hammering the L2.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
116
236
76
Thanks for the screenshot of the table. Really impressive they managed to cram all this new tech into GB206 die that's still smaller than AD106, Tons of low level optimizations and/or the silicon overhead is just minimal.

?
the point is that you get accelerated shmem transfers.
Without hammering the L2.
Yeah my explanation isn't great. Maybe someone else can explain it better.

Sure but you can't do the massive image processing tiles Cerny talked about without that. Just not feasible.

A related quote in case anyone is interested: "DSMEM enables more efficient data exchange between SMs, where data no longer must be written to and read from global memory to pass the data. The dedicated SM-to-SM network for clusters ensures fast, low latency access to remote DSMEM. Compared to using global memory, DSMEM accelerates data exchange between thread blocks by about 7x."
- From https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
 

Panino Manino

Golden Member
Jan 28, 2017
1,144
1,383
136
Won't argue about Threat Interactive, but I have to say, it's hard for me to understand something.
Why? With the flow of time, even as more and more processing power and memory becomes available, why does game graphics seem to have more and graphical compromisses and compromisses?
It always seems that more and more and required to do the same things that were perfected before.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,076
3,907
136
Won't argue about Threat Interactive, but I have to say, it's hard for me to understand something.
Why? With the flow of time, even as more and more processing power and memory becomes available, why does game graphics seem to have more and graphical compromisses and compromisses?
It always seems that more and more and required to do the same things that were perfected before.
i think the only real regression has been deferred rendering , hopefully at some point we can get back to some form of forward rendering that supports large number of light sources. Then we can have real AA again.
 

soresu

Diamond Member
Dec 19, 2014
4,117
3,577
136
And the fact modern games look bad is just me imagining things?
Look bad?

Doom Dark Ages looks great IMHO, a significant step up from Doom Eternal.

Though clearly good art direction and/or cinematography is a big part of it, as the later addition of path tracing makes little difference to the cut scene visual fidelity.

Merely having a great rendering engine isn't nearly so good as having a director who actually knows what they are doing with it.

I doubt that merely putting Doom Eternal assets in the id Tech 8 engine would be anywhere near as effective.
 
  • Like
Reactions: Tlh97 and MrMPFR

soresu

Diamond Member
Dec 19, 2014
4,117
3,577
136

MrMPFR

Member
Aug 9, 2025
116
236
76

TL;DR:​

Can be skipped: Last info dump was mainly scheduling related. Now I'll be addressing some additional potential RT related changes in RDNA 5 and correcting my old ignorant comments. This expands and compliments previous posts by others and myself included.
The scope of proposed changes is massive which explains the length of the post. Probably takes 15 minutes to read it all, but I've summarised the most important insights here.

The following post contains patent derived info and the usual caveats apply. Not confirmed yet, but it's very likely all things considered. The exact implementations may not mirror patents 1:1 but impacts should be roughly the same.
Finally we still need confirmation from official source or a reputable leaker to be certain which parts are in RDNA 5.

The Good Stuff: The scope of the patent derived RT core changes related to DMMs, DGF and prefiltering for RDNA 5 are on their own easily enough to quality for a clean slate ray tracing pipeline.
I've talked about these changes before but reading the patents again, properly this time, reveals additional info that expands the scope of changes significantly, which will be summarized here:
  1. Pre-Filtering Pipeline: They implement very wide low-precision intersection testers (pre-filtering) to mass cull triangles in DGF/prefiltering nodes and DMMs that are incredibly fast, low latency, with a tiny area overhead. They reduce cachemem load due to many factors including a reduced precision scheme of almost half-precision integer math vs full-precision FP math.
  2. Integer dominates FP: INT mostly replace traditional FP tests and in some instances FP tests are actually never required.
  3. Versatile Pre-filtering: Wide BVH (BVH8-16 or higher) using pre-filtering can be used for ray intersection tests against all primitives, including linear swept spheres and bounding boxes.
  4. DGF and DMM benefits: DGF and DMMs have lower cachemem overhead and boost performance on their own without pre-filtering
  5. Fallback method and always cache aligned: Whenever possible the RT pipeline strives to bundle geometry into cache aligned fixed size data structures. It uses these to reduce memory transactions and load on cachemem system. Fallback called prefiltering nodes used when DGF isn't implemented. I call it DGF Lite since it only compresses vertices.
  6. Less decode and data prefetching: With prefiltering the decoding overhead from DGF and prefiltering nodes can be reduced. Less data also needs to be fetched as stored and full data newer prefetched and only fetched when needed.
  7. Novel DMM encoding: DMM encoding scheme is novel replacing 64 subdivided triangles with 14 and can be evaluated with just one traversal step using BVH14, instead of two-three.
  8. Prism Volume HW: Dedicated Bounding Circuitry in RT cores construct prism volumes to accelerate ^
  9. Math enabling prefiltering: Various precomputations enable low-precision tests at ray setup.
  10. Quantized OBBs: OBBs using platonic solids to quantize them which enables prefiltering of ray/box intersections.
  11. Goated CBLAS: DGF and Prefiltering nodes are basically made for compacted CBLAS BVH architecture. With DMM on top this is even more insane as we can see up to ~16,400 times reduction in the number of leaf nodes compared to conventional method. In addition there's more than an order of magnitude reduction in BVH footprint compared to RTX Mega Geometry at iso-geometric complexity.
  12. Less redundant math: Reduced Configurable inside/outside ray/edge test sharing directly benefits from DGF adoption, but can work with all kinds of geometry. Pre-filtering provides further speedup here.
  13. The Holy Grail - Partial Ray Coherency Sorting: Ray coherency sorting for leaf nodes by coalesced rays against the same DGF/prefiltering node and executing them all at once within a RT core before switching to the next node. Massive cachemem load reduction and improved coherency.
  14. DGF + DMM ^: With DGF/Prefilter nodes + DMMs the scope of ray coherency sorting can be expanded to cover more of BVH and may be implemented an additional time at DMM base triangle to minimize load on the cachemem system and Bounding Circuitry for Prism Volumes.
Take away:
The new RT core implementation in RDNA 5 appears mighty impressive. It's without a shadow of a doubt well beyond 50 series on many fronts. This is multiple gens of leapfrogging in architectural sophistication. NVIDIA better be cooking something good with Rubin or find a new thing to chase.

These architectural changes are not about brute forcing the problem by mindlessly throwing more logic and cache at the problem. They're about redesigning the entire pipeline from the ground up with with ingenious optimisations derived from first principles thinking. The results are as expected.

Displaced Micro-Maps (DMMs)​

- Then lists the three patents related to an AMD DMM implementation as beyond current µArch, when DMM has been supported on RTX 40 series since 2022.
How is AMD's implementation beyond Ada's DMM decompression engine (Blackwell removed it)?
I'm sorry Kepler for not bothering to actually read the patent. The leapfrogging is obvious and significant.

AMD uses a special method to replace the traditional three subdivision of one base DMM triangle (64 tris) with 14 triangles derived from three tetrahedra + two triangles. In other words 4.6x reduction in triangle count.
The standard technique which would describe the DMM bounding volume using two or three lower level BVHs within it corresponding to respective BVH8 or BVH4 node implementations. But with a wide parallel low-precision integer based (Pre-filtering) pipeline the three subdivisions can be described with just one BVH14 node. In other words when the ray hit at DMM base triangle is confirmed just one additional traversal step (the BVH14 node) where everything is evaluated in one go. In addition any floating post tests and data is only fetched if an inconclusive result is obtained from one of the triangles, so in many instances it'll be skipped entirely.

And if you wonder if that's possible here's a patent that mentions that an exemplary implementation in which"...each group of low-precision testers 306 can test a ray against 8 to 16 triangles at once.” So BVH14 doesn't sound impossible albeit it's an odd number and a very wide BVH for ray/tri evaluations.

There's more things as well such as the Tessellation Circuitry that decode the subdivisions and prepare DMMs probably aligning with the improved scheme AMD came up with and a Bounding Circuitry that generates respective Prism Volumes on demand which helps to shrink BVH size significantly.

I doubt NVIDIA does their Prism Volume builds on GPU, but if they do AMD's pre-filtering, very wide BVH14 format, and their new DMM triangle encoding scheme are still major changes.

Prefiltering Computational Efficiency​

I've already explained what this is before, so won't go into too much detail, but will provide additional info. A prefiltering pipeline is a set of smaller and simpler integer based ray intersection testers for triangles, bounding boxes or other primitives (linear swept spheres...) that do these evaluations in parallel across a very wide BVH with 8-16 child nodes/primitives in one example.
This seems counter intuitive given how demanding especially ray/tri evaluations are, but the problem with RT is not calculations, it's cachemem bottlenecks. Thus it's desirable to make the BVH as wide and shallow is possible. In addition the reduced-precision Integer based intersection testers are tiny in comparison: "Performing arithmetic on reduced precision values is significantly simpler and therefore requires significantly less silicon area (i.e., multiple times less area) than arithmetic on full precision, and we expect that it should also use less power and reduce latency."

I don't think 2-3x intersections/area is enough here as you're not just going from a FP32 to ~FP16 pipeline, but ~INT16 which reduces a lot of overhead associated with FP which is described in greater detail here: "Generating the quantization box 602 and overlaying the quantization box 602 with the grid advantageously enables an intersection testing circuitry to perform tests using arithmetic with just small Q-bit integer values (or slightly larger, e.g., Q+3 bit values). By comparison, conventional testers use single-precision floating point arithmetic that perform computations on 23-bit mantissas and need additional computational resources for 8-bit exponents, sign bits, Not a Number (NaN), denormalized numbers, etc. Using fixed-point circuitry for testing allows for much smaller arithmetic logical unit (ALU) circuitry.”

I was foolish enough to attempt get a viable answer from CoPilot, but it was all over the place. Anywhere from 2x to 10x, was constantly flip flopping, and as always severe sycophancy. If someone smarter than me can reasonably guesstimate increase in intersections per area/watt/kb Cache at same process node I don't think I'm the only one who would appreciate.
Otherwise I guess we just have to wait for the official AMD breakdowns where intersections/mm^2 and cache overhead reduction will surely be discussed.

As previously mentioned FP tests are still needed for inconclusive triangles but it's just one, but these only arise when INT tests doesn't confirm, which they in many cases will and in some cases FP tests can be avoided entirely:
"...stand-alone low-precision test may be used for coarse level-of-detail cases, e.g., where an object is tiny, or far off in the distance (and so is only a handful of pixels in the image), or otherwise may be rendered crudely. In such cases, t-values and barycentric coordinates can be calculated as described above, and the full-precision testing can be entirely avoided by using the low-precision test results as the only parameter in the calculation"
I see no reason why this can't be implemented through compiler so significant additional speedups can be expected ALL ray traced games.

To summarize in all likelyhood AMD will probably shrink the footprint of the FP testing circuitry significantly in RDNA 5, since it'll not be used very often. They'll more than offset it with a humongous increase in intersection rate from tons of parallel INT testers which could result in intersection testing logic at iso-area on N3P vs N4 being massively more potent than what is possible with FP tests.

Prefiltering Memory Efficiency​

A large reduction in cache overhead per BVH node can be expected as well, but this will depend on how well the data is compressed more so than raw ray/tri intersection rate. Just like in prior implementations this will be the bottleneck determining BVH traversal rate.

As for BVH data all you need to know is that the data required for computations gets compressed as integer math instead of floating point. Since we're not dealing with decimals precision can be reduced from single-precision (FP32) to almost half-precision (INT16): “The power-of-two box dimensions can provide computational benefits in that the bounding box can be compactly stored, e.g., by only storing the bfloat16 value of a minimum corner and exponent byte of the power-of-two box dimensions.”

Prefiltering Additional Info​

Special precomputations are required to enable ray setup for pre-filtering pipelines this but AMD has another patent for that, making it viable at runtime. I won't go into this here but these are significant since FP and INT ray evaluatons are quite different.

Most of the benefits from DGF are still possible without it. They have a fallback method using prefiltering nodes that only quantize vertices vs DGF's more comprehensive scheme. But this can still enable CBLAS, significant BVH compression, and massive speedups as a result of pre-filtering.

The new quantized OBBs that use Platonic solids are now the default encoding scheme (prefetched) for Oriented Bounding Boxes and is built for mass pre-filtering of ray/box intersections. This should help reduce HW complexity and lower OBB BVH footprint: "...provides several benefits, such as reducing the complexity of the hardware for applying the orientation of the oriented bounding box to the ray. The amount of data required to be stored is reduced as compared with an implementation that uses more orientation data because the reduced number of possible rotations can be represented with a smaller amount of data.”."

Will it be as good as RDNA 4's OBBs? No, but it's be superior to regular AABBs and once again the bottleneck isn't computations it's cachemem related.

Dense Geometry Format​

It's just old boring DGF. Really hoped for more in RDNA 5 even if it's still beyond Blackwell.
I'm taking that back. DGF is great and Cerny is 100% correct in that DGF enables flexible and efficient data structures by keeping as much data wrapped into cache aligned fixed size packages. I was just expecting additional changes related to data structures like overlay trees and delta instances in HW.
Once again bothering to read the entirety of related patents reveals that DGF + Prefiltering has synergies that go well beyond what you would expect and is definitely a clean slate of RT pipeline in itself.

Pre-filtering with DGF is also much faster on the decode side as data doesn't need to be de-quantized enabling faster DGF/prefiltering node decompresion compared to full-precision (FP). And since there's a dedicated DGF/Prefilter Decompression we can expect this decompression to be sufficiently quick.

DGF has superior cache mem characteristics on all architectures, but really needs pre-filtering to shine: "...parallel rejection testing of large groups of triangles enables a ray tracing circuitry to perform many ray-triangle intersections without fetching additional node data (since the data can be simply decoded from the DGF node, without the need of duplicating data at multiple memory locations). This improves the compute-to-bandwidth ratio of ray traversal and provides a corresponding speedup.”
In addition it "...removes the need for fine-grained spatial partitioning at the triangle level, resulting in a lower memory footprint and lower memory bandwidth during traversal.”

Another benefit of DGF pre-filtering is that it doesn't need to fetch high-precision triangle data unless a floating point test is neccesary. So keeping everything fixed-point by default should debloat the private RT core caches significantly.

Otherwise all previous characteristics of pre-filtering nodes are still in effect. The only difference is that DGF decoding takes longer, but has much improved compression.

Sharing results of inside/outside ray/edge testing removes redundant calculations and from what I can gather exploits the triangle strips in DGF. It can even be done using low-precision parallel intersection testers similar to DGF.

So there's a lot more to DGF than the apparant first impression. It's really a cascade of optimizations derived from the original idea of cache aligned coherent fixed size geometry blocks. Someone at AMD prob had a Eureka moment regarding prefiltering during the early development of DGF.

Leaf Node Ray Coherency Sorting​

Yes that's right AMD actually mentions this in the one patent. This is really the last thing I expected but it totally makes sense. Since we have fixed size leaf nodes with DGF and prefiltering nodes why not leverage that for ray coherency sorting. I suspect this ray coelescing will be comprehensive and could benefit from a dedicated sorting HW.
Whether RDNA 5 includes additional mechanism for coherency sorting beyond this and SWC who knows, but at least this implies a level 4 implementation for leaf nodes.

Sure this isn't as comprehensive as Imagination Technologies Packet Coherency Gather (sorry for provoding incorrect patent link, fixed) that cover the entire RT pipeline, but AMD did find their own a workaround for the leaf node triangles tests which by far are the most problematic and divergent. This is another step up in cache system efficiency: from at one to multiple memory transactions per triangle for each ray, to one transaction per DGF block for each ray, and now just a one to a few transactions for all rays coalesced against that node (contains 1-4 DGF blocks). Yes FP tests will be needed but again this only has to be done one or a few times and not on a per ray basis.

Ray coalescing requires deferred execution to work and this will be a major change vs the current paradigm that races through entire BVH and when hit is determined immediately execution an any hit shader. In addition there's another patent for deferred any hit shader evalution that coalesces execution items against an any hit shader, so this is possibly a recuring theme with RDNA 5.

Anyways what we can expect is probably a pipeline that executes as you would normally expect from TLAS to the DGF (prefiltering if you replace 1:1) node , where the proposed RT pipeline could be as follows: #1 wait for all rays to reach a DGF node, #2 sort rays by DGF node trajectory, #3 schedule all rays hitting one DGF node at only RT core and fetch data, and #4 only when all rays have processed switch DGF node.

For anyone doubting this is the quote:
"The intersection scheduler pipelines data decoding requests to the decompressor 500 for multiple rays coalesced against the same node. All the rays coalesced against the same node are sent first before switching to another node. The decompressor 500 switches to the next ray only after decoding the last batch test for the current ray. Further, the decompressor 500 switches to the next node only after decoding the last batch test of the last ray for current node it is processing.”

DGF + DMM = Goated CBLAS​

The wild potential synergies between DGF and DMM arises when yet again bothering to read the entire patent. This patent mentions data nodes corresponding to a DGF Array consisting of multiple DGF blocks: "...each data block can store data for up to 64 triangles, whereas each array can store data pertaining to 256 triangles, 256 vertices, and 64 materials.” Other patents mention clusters of 65-128 tris in a BVH data node so that seems more likely, but we'll see.
Then with the DMM multiplier on top we get 256 x 64 tris or up to 16,384 triangles/data node or CBLAS in a BVH. With proposed 65-128 tri clusters this is still 4,160-8192 tris/CLAS. Compared to DGF CBLAS that's 64X fewer leaf nodes. The leap over the conventional approach is even greater with a ~3-4 orders of magnitude reductions the number in BVH leaf nodes.

For detailed micro-polygon geometry such as Nanite the speedup in BVH build times would make RTX Mega Geometry look pathetic in comparison, not to mention the BVH footprint. With DGF AMD only says BVH can be compressed 2.7-3X but that's with full precision data. The DMM compression factor depends on the asset mesh but ~6.5-20X from the Ada Lovelace Whitepaper. Regardless this is easily more than an order of magnitude reduction in BVH footprint compared to RTX Mega Geometry.

Other Impacts of DGF + DMM​

I have to caution that this kind of a setup causes a big shift in the ratio between ray/tri and ray/box intersections and may not be practical in all situations. But the biggest benefit is related to increased impact of ray coalescing. The implementation would make the non-leaf part of the BVH shallower and minimize the computational overhead from the ray/box intersections while increasing the amount of triangles exposed to ray coalescing and ray/tri intersections. Given how large these data nodes would be two ray coalescing events seems like the way to one: One at DGF block and another at DMM base triangle.
This would expose three traversal steps to the impact of ray coalescing. Two to reach DMM base triangles and one to locate the intersected DMM sub-triangle. For a 128 tri DGF Array/Node + DMM that would be BVH8 -> BVH16 -> BVH14. For 256 tri DGF Array/Node + DMM the pipeline would be BVH16 -> BVH16 -> BVH14

The DGF pipeline would be as described earlier and the DMM pipeline would closely mirror that: #1 traverse all DGF nodes for all rays until hit DMM base triangles, #2 sort rays by DMM base triangle trajectory, #3 schedule all rays hitting one DMM base triangle to be run together within the RT core and fetch data, and #4 when all rays have processed switch DMM triangle. It would still run within the higher DGF node step, but coalesce ray tests on a per DMM base triangle basis instead a per DGF node basis. In doing so prism volume related overhead at the Bounding Circuitry and additional DMM base triangle related data fetches can be massively reduced; the effects are here similar to ones observed with DGF nodes.

What lies beyond DMMs?​

I found this novel Holger Gruen patent, but I haven't seen any AMD research papers into it or additional patents, so I don't think this one is making its way into RDNA 5. Edit: Ignore see later comment.
It encodes a high resolution mesh into a low resolution mesh by "...determining points on curved surfaces of curved surface patches defined for one of triangles and bi-linear quadrangles of the low resolution version of the high resolution mesh."

Instead of ray tracing triangles which is the case with DMM and most other techniques this one traces ray against entire curved surface patches with interpolated normals from the points along the patches. It's possible the ray evaluation is too complex or something else IDK, but the it achieves "...higher compression ratios than conventional techniques" and is superior to DMM in terms of processing overhead by "...encode the offsets without evaluating data" .

Edit: @vinifera found another patent that is actually related. It pertains to animated geometry using curved surface patches: "The circuitry generates a curved surface patch for each of the multiple base triangles of the first low LOD object. The circuitry divides the base triangles into multiple sub-triangles and generates modified displacements at the vertices of the base triangles and the sub-triangles. The circuitry generates, using the modified displacements, another low LOD object that is a representation of the high LOD object and conveys it to the display device." This patent is newer and filed alongside the new DMM patents, but like with prev patent no HW blocks discussed, so this might still be too novel to implement in HW.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
116
236
76
@vinifera you're free to steal this formatting an include in your post. Makes it easier to reference later. I'll delete when it's moved:
END
---------------------------

Reporting and analysis​

#1: Looks like deadlock and long latency stall mitigation that can make GPUs more versatile (i.e. supporting more application types). Introduces fine-grained context saves and restores on GPU down to wavefront level.
Might be related to this patent, that sounds a lot like Volta's Independent Thread Scheduling: https://patents.google.com/patent/US20250306946A1
- Guessing this is CDNA 5 related.

#2: A Shader Engine level payload sorting circuit coupled to the Work Graph Scheduler. Might also be implemented at CU level. It is a specific HW optimization for work graphs independent of compute units. It "...improves coherency recovery time by sorting payloads to be consumed by the same consumer compute unit(s) into the same bucket(s). The producer compute units are able to perform processing while the sorting operations are being performed by the sorting circuit in parallel."
While the main target is work graphs the technique "...applicable to other operations, such as raytracing or hit shading, and other objects, such as rays and material identifiers (IDs)." Complementary to the Streaming Wave Coalescer.
- Since they mention rays it's very possible that this unit is responsible for the ray coalescing against DGF nodes that I described earlier. Very likely a RDNA 5 patent. Chajdas is involved and once again this optimization is crucial for Work Graphs.

#3: This allows a ressource for a second task to be assessed in advance without interfering with first task. It's as follows: execute first task, then initiate second task, but pausing before accessing said ressource, and if ressource for second task is ready after completion of first task then it gets executed. Looks like this is implemented at the Shader Engine level. The patent states: "...sequential tasks can be executed more quickly and/or GPU resources can be utilized more fully and/or efficiently."
- Not sure about this one, but could end up in RDNA 5 or perhaps CDNA 5.

#4: A method of animated compressed geometry that's based on curved surface patches. This is related to the beyond DMM patent I discussed in prev post.
- Looks too novel to be in RDNA 5 + no HW blocks specified. Gruen is the sole originator.

#5: A method of deferring any hit shader execution until which makes it"...possible to group the execution of an any hit shader for multiple work-items together, thereby reducing divergence."
- This is a big deal, possibly even bigger than SER if they can make the any hit shader evaluation very coherent. NVIDIA said this at the launch of SER: "With increasingly complex renderer implementations, more workloads are becoming limited by shader execution rather than the tracing of rays." Until fairly recently I thought SER was for coalescing ray tracing operations. Yeah I know it's stupid.
- This patent has McAllister listed alongside many researchers. Has to be in RDNA 5 since not including it would be asinine.

#6: This looks like the technique behind the Animated DMM GPUOpen paper unveiled at Eurographics 2024 and shared by @basix.
- I don't see specific HW mentions of logic for the animated DMMs beyond basic DMM HW pipeline, but AMD needs this or a better approach because the paper stated that it on RDNA 3 has "...∼ 2.3−2.5× higher ray tracing costs than regular ray tracing of static DMMs." Gruen is the sole originator.

Conclusion

#2 and #5 are most important and will almost certainly end up in RDNA 5 on top of what I previously discussed in last comment. Check out the new TL;DR if you don't fancy reading it all. Then include additional public patents and undisclosed patents. Strongly implies their GFX13 RT implementation is leap frogging NVIDIA Blackwell by several gens, well at least in sophistication as they could decide to just gimp RT cores to save on die space. But it looks like AMD might turn the tables against NVIDIA in RT nextgen, but Rubin is still a joker so anything could happen.
If they loose NVIDIA will prob go: "RT is for console peasants, now here's a selection of generated AI games that can run on the new 6090 at 20 frames per second. We use DLSS and MFG to run it at 120 FPS xD." or "Now our tensor cores are so powerful that we can replace most of the ray tracing pipeline and it looks better."

Regardless not surprised AMD and Sony is openly talking about path tracing on future HW when the pipeline looks this capable. Hope they resist temptation offsetting architectural sophistication with less HW by of cutting it down because it's "good enough". It can be amazing it they let it shine.

hey that's normal, Intel GFX R&D guys got swallowed whole by AMD.
Think we're beginning to see the results of that in patent filings rn.

Looks like RDNA5 def won't be short of paradigm shifts and novel ideas.
 
Last edited:
  • Like
Reactions: marees