Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 57 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

marees

Golden Member
Apr 28, 2024
1,737
2,378
96
DGF + prefiltering related patents:
AMD blog on DGF (from feb) — NANITE accelerator for RT use cases

A custom compression format can’t be consumed directly by the current raytracing API. Instead, we need to decode it into something which the APIs understand. This increases memory pressure and adds latency to the BVH build, both of which can lead to unstable, stuttery frame rates. Even if the API restrictions were lifted, the existing hardware acceleration structures are much too large to support future content, which will be authored with the lower data rates in mind.

Dense Geometry Format (DGF) is a block-based geometry compression technology developed by AMD, which will be directly supported by future GPU architectures. It aims to solve these problems by doing for geometry data what formats like DXT, ETC and ASTC have done for texture data.

Native GPU support for DGF will close the geometric complexity gap between raster and ray tracing

By moving geometry compression outside the driver, the compressor can process the data in ways that would either be too slow to perform at runtime or would violate the API specifications.


DGF is engineered to meet the needs of hardware by packing as many triangles as possible into a cache aligned structure. This enables a triangle to be retrieved using one memory transaction, which is an essential property for ray tracing, and also highly desirable for rasterization.


AMD Dense Geometry Compression Format SDK​

AMD has released the Dense Geometry Compression Format (DGF) SDK to encourage developers to experiment with geometry compression, and provide a reference toolchain for integration into content pipelines.

If you have questions about the data format and encoding algorithms, or want more data points, you can also read our HPG 2024 paper.



Our AMD DGF SDK was recently updated with bug fixes, improvements, and new features. One of our most exciting new features is the addition of an animation-aware encoding pipeline. You can refer to the release notes for a full list of changes.

In this blog post, we take a look at animations and how they work with DGF.

 
Last edited:

MrMPFR

Member
Aug 9, 2025
103
206
71
Thought RX and MI wouldn't become more µarch aligned given Kepler's earlier statements on UDNA.
Nvm I didn't recall things correctly. Here's the explanation from Kepler:
People misunderstood what "UDNA" meant. It's not a single architecture across gaming and datacenter, but a unification of the development pipeline.

CDNA1/2/3/4 have many architecture advancements that are not in RDNA2/3/4 because they are in a completely different architecture branch.

With "UDNA" strategy, development follows a gaming->datacenter->gaming pattern, where advancements from one type of architecture can be re-integrated into the next if they make sense, but the architectures are still different as they don't need to have the same features (i.e gaming doesn't need strong FP64 or extremely large matrix cores, datacenter doesn't need RT/Texture/Geometry/Raster features).
 
  • Like
Reactions: Tlh97 and marees

MrMPFR

Member
Aug 9, 2025
103
206
71
AMD blog on DGF (from feb) — NANITE accelerator for RT use cases]
DGF isn't Nanite it's really just AMD's take NVIDIA's now deprecated Displaced Micro-meshes introduced with Ada Lovelace, albeit with fewer drawbacks and some extra functionality, for example an OMM header (see patent) and I don't recall support for animation in DMM either.
It's a great fit for a clustered BLAS architecture, but that doesn't make it a Nanite accelerator. It's just a geometry compression format. You're gonna need a complete BVH management overhaul for it to actually work with Nanite like NVIDIA's RTX Mega Geometry. AMD still has a long way to go :(

IDK about the DMM patents Kepler posted ~2 months ago, but perhaps that could be leverage for faster ray/tri evaluation for Catmull Clark subdivisions. An important component considering how heavily RTX Mega Geometry relies on tesselation and subdivisions (read the NVPro-samples github documentation), so any AMD implementation is likely to heavily on this to reduce rebuilds.

Before I go out on a tangent have to preface that this is just speculation and suggestions, so don't take it seriously until we have actual confirmation. Lots of details omitted as well given all the proposed changes from the six linked patents.
Can see you linked the related patents. Lots of unaswered questions remain regarding those DGF and prefiltering patents. For example is the stuff beyond DGF decompression in HW actually in RDNA 5 or just speculation at this point? If true then precomputations in preparation for Quantized BVH + load data for prefiltering = INT based bulk testing of triangles and likely even OBBs to reduce FP tests. INT allows multiple times more RT intersections at iso-power and area. Still some FP evaluations for uncertain results which scales down the speedup somewhat, but should still deliver ludicrous gains to RT if implemented.

In the Animation DGF post they mentioned loading an entire DGF block into LDS, probably with just one memory transaction. This includes animating geo data and OMM, not confirmed in blog but included in linked patent, all neatly organized into one coherent cache aligned package. From here https://patents.google.com/patent/US20250131640A1:
"In an implementation, the intersection scheduler pipelines data decoding requests to the decompressor 500 for multiple rays coalesced against the same node. All the rays coalesced against the same node are sent first before switching to another node. The decompressor 500 switches to the next ray only after decoding the last batch test for the current ray. Further, the decompressor 500 switches to the next node only after decoding the last batch test of the last ray for current node it is processing."

No idea how DGF on RDNA 4 cards differs (SW vs HW): Load entire DGF block, decode in parser storage (sounds like RT cache) then do cachemem coherent (less evictions and reloads) mass ray/tri testing. Sounds like another HW feature boosting perf.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
He. Said. It.

The guy responsible for GPUs at AMD.

:mad:

Fine. Make him unsay it then.

Tomorrow. On news sites. He apologizes and says that the word UDNA was a slip of the tongue.

Maybe the code name for the next version of CDNA is UDNA just to confuse the ever loving hell out of everyone. I proposed Project Opposite Day once, just so that when it was terminated the following week no one was quite sure of the actual status of the project. It didn't sow quite as much confusion as Project Withajee, but for some odd reason I've been banned from coming up with project names.
 

MrMPFR

Member
Aug 9, 2025
103
206
71
Interesting: co-compute linked to L3 to avoid cache-thrashing of L1 cache (for memory intensive loads such as RT)

This seems like would need lots of more die area 🤔
Or is this just renaming of RT core to a more generic co-compute core 🤔 🤔

So it seems, but patent specifies it can be any non-local cache, so they could be coupled to Shader Engine private cache, L2, or MALL.

If the reduced L2 (AT2 = 24MB L2 vs Navi 48 = 64MB MALL) is accurate and CCU is leveraged for RDNA 5 then those being coupled to L2 would shrink L2 for other processes since they require sizeable dedicated. Wonder how AMD engineers would tackle this. A SE cache implementation could happen as well, but it would require a much bigger SE cache, but some other benefits like superior cache latency, closer integration with CUs (routing and latency) etc... Latter seems more likely given the entire thing about the supposedly (not confirmed IIRC) overhaul with autonomous SE scheduling and dispatch (WGS and ADC). In this case CCUs outside SEs would complicate things alot.
There's also a virtual CCU implementation. The patent doesn't clearly specify the difference but sounds like pooled CCU ressources managed by a central scheduler instead of one private CCU for each CU. This one directly conflicts with the WGP Local launch patent, a major HW optimization for workgraps, so still lean towards a non virtual CU-CCU direct link.

Remember CCU offloads work from CU, so overhead might be lower than what it seems. No need to duplicate instructions, but yeah still some overhead, but how much?

Highly doubt that. But some BW heavy RT instructions could be offloaded to CCUs.
 

marees

Golden Member
Apr 28, 2024
1,737
2,378
96
So it seems, but patent specifies it can be any non-local cache, so they could be coupled to Shader Engine private cache, L2, or MALL.

If the reduced L2 (AT2 = 24MB L2 vs Navi 48 = 64MB MALL) is accurate and CCU is leveraged for RDNA 5 then those being coupled to L2 would shrink L2 for other processes since they require sizeable dedicated. Wonder how AMD engineers would tackle this. A SE cache implementation could happen as well, but it would require a much bigger SE cache, but some other benefits like superior cache latency, closer integration with CUs (routing and latency) etc... Latter seems more likely given the entire thing about the supposedly (not confirmed IIRC) overhaul with autonomous SE scheduling and dispatch (WGS and ADC). In this case CCUs outside SEs would complicate things alot.
There's also a virtual CCU implementation. The patent doesn't clearly specify the difference but sounds like pooled CCU ressources managed by a central scheduler instead of one private CCU for each CU. This one directly conflicts with the WGP Local launch patent, a major HW optimization for workgraps, so still lean towards a non virtual CU-CCU direct link.

Remember CCU offloads work from CU, so overhead might be lower than what it seems. No need to duplicate instructions, but yeah still some overhead, but how much?

Highly doubt that. But some BW heavy RT instructions could be offloaded to CCUs.
Is this CCU restricted only to RT cores or can it do the matrix*vector multiplication of the tensor cores too ??
 

MrMPFR

Member
Aug 9, 2025
103
206
71
Is this CCU restricted only to RT cores or can it do the matrix*vector multiplication of the tensor cores too ??
The patent is vague so it could probably apply to whatever instructions AMD thinks they need to offload: ML, RT and other cache greedy instructions.

Biggest implications might not be for RDNA 5 but actually MI500. Maybe offload GMV to AID and take advantage of that massive MALL slab? Or perhaps a clever cache hierarchy bypass mode (not in patent), allowing cores to bypass local registers and XCD cache and store everything in AID MALL slab? But all this could be less important if HBM4 with PIM gets rolled out in 2027.
A superficial description of PIM for GPT workoads: https://www.servethehome.com/samsung-processing-in-memory-technology-at-hot-chips-2023/
 
  • Like
Reactions: marees

adroc_thurston

Diamond Member
Jul 2, 2023
7,074
9,824
106
Biggest implications might not be for RDNA 5 but actually MI500. Maybe offload GMV to AID and take advantage of that massive MALL slab? Or perhaps a clever cache hierarchy bypass mode (not in patent), allowing cores to bypass local registers and XCD cache and store everything in AID MALL slab? But all this could be less important if HBM4 with PIM gets rolled out in 2027.
there is no MALL there
 

MrMPFR

Member
Aug 9, 2025
103
206
71
Hmm. So MI400 and MI500 also deprecates MALL?

Kepler doesn't seem certain here, but perhaps that only applies to true chiplet consumer GPUs not DC.
For actual chiplet stuff (SED+AID+MID) I think so, but for the quasi-monolithic GPUs like ATx with just GMD+MID it's gone.
 

ToTTenTranz

Senior member
Feb 4, 2021
686
1,147
136
Yeah why not.

MALL's dead because SRAM is kinda expensive.
SRAM should start getting less expensive because N2 and lower finally show a substantial increase in transistor density for memory cells.

It seems to me that AMD looked at GDDR7's big bandwidth boost over GDDR6 (and LP6 over LP5X) and thought they could reduce the cache footprint on their GPU dies, probably with an increase in L2 that gets a better hitrate-per-MB than L3 (or maybe it's just a lot faster which gets them higher .effective bandwidth).
 

MrMPFR

Member
Aug 9, 2025
103
206
71
SRAM should start getting less expensive because N2 and lower finally show a substantial increase in transistor density for memory cells.
It's only roughly matching the pathetic 20% logic scaling with N2 after being completely stagnant for one gen (N5->N3) and underwhelming for another gen (N7->N5). Things don't look better for Angstrom era nodes. Complete joke.
Given +50% wafer premiums I really don't see any GPUs besides ultra high end and DC moving beyond N3 anytime soon. On-chip photonics and other paradigm shifts can't come soon enough.
 

soresu

Diamond Member
Dec 19, 2014
4,101
3,560
136
Maybe the code name for the next version of CDNA is UDNA just to confuse the ever loving hell out of everyone. I proposed Project Opposite Day once, just so that when it was terminated the following week no one was quite sure of the actual status of the project. It didn't sow quite as much confusion as Project Withajee, but for some odd reason I've been banned from coming up with project names.
Oh lol you delicious troll 🤣😂