Discussion RDNA 5 / UDNA (CDNA Next) speculation

ToTTenTranz · Sep 30, 2025

poke01 said:
you don't need to improve your NPU anymore if your GPU is very good for AI

This is honestly the best outcome, as long as they don’t take away the render backend units...

marees · Sep 30, 2025

MrMPFR said:
DGF + prefiltering related patents:

PRE-FILTERING NODES FOR BOUNDING VOLUME HIERARCHY,

From <https://www.patents-review.com/a/20250111586-pre-filtering-nodes-bounding-volume-hierarchy.html>

Intersection Testing on Dense Geometry Data using Triangle Prefiltering

From <https://www.patents-review.com/a/20250131640-intersection-testing-dense-geometry-data-triangle.html>

Dense Geometry Format

From <https://www.patents-review.com/a/20250131639-dense-geometry-format.html>

SIMPLIFIED LOW-PRECISION RAY INTERSECTION THROUGH ACCELERATED HIERARCHY STRUCTURE PRECOMPUTATION

From <https://www.patents-review.com/a/20...w-precision-ray-intersection-accelerated.html>

System and Method for Low-precision Ray Tests

From <https://www.patents-review.com/a/20250209723-system-method-low-precision-ray-tests.html>

DISCRETE ROTATIONS FOR ORIENTED BOUNDING BOXES BASED ON PLATONIC SOLIDS

From <https://www.patents-review.com/a/20...s-oriented-bounding-boxes-based-platonic.html>

AMD blog on DGF (from feb) — NANITE accelerator for RT use cases

A custom compression format can’t be consumed directly by the current raytracing API. Instead, we need to decode it into something which the APIs understand. This increases memory pressure and adds latency to the BVH build, both of which can lead to unstable, stuttery frame rates. Even if the API restrictions were lifted, the existing hardware acceleration structures are much too large to support future content, which will be authored with the lower data rates in mind.

Dense Geometry Format (DGF) is a block-based geometry compression technology developed by AMD, which will be directly supported by future GPU architectures. It aims to solve these problems by doing for geometry data what formats like DXT, ETC and ASTC have done for texture data.

Native GPU support for DGF will close the geometric complexity gap between raster and ray tracing

By moving geometry compression outside the driver, the compressor can process the data in ways that would either be too slow to perform at runtime or would violate the API specifications.

DGF is engineered to meet the needs of hardware by packing as many triangles as possible into a cache aligned structure. This enables a triangle to be retrieved using one memory transaction, which is an essential property for ray tracing, and also highly desirable for rasterization.

AMD Dense Geometry Compression Format SDK

AMD has released the Dense Geometry Compression Format (DGF) SDK to encourage developers to experiment with geometry compression, and provide a reference toolchain for integration into content pipelines.

If you have questions about the data format and encoding algorithms, or want more data points, you can also read our HPG 2024 paper.

Solving the Dense Geometry Problem - AMD GPUOpen

Discover how AMD's Dense Geometry Compression Format (DGF) revolutionizes graphics by compressing complex models for efficient real-time rendering, bridging the gap between rasterization and ray tracing.

gpuopen.com

Our AMD DGF SDK was recently updated with bug fixes, improvements, and new features. One of our most exciting new features is the addition of an animation-aware encoding pipeline. You can refer to the release notes for a full list of changes.

In this blog post, we take a look at animations and how they work with DGF.

Animating geometry with AMD DGF - AMD GPUOpen

The AMD DGF SDK has been updated with improvements and new features, including the addition of an animation-aware encoding pipeline.

gpuopen.com

MrMPFR · Sep 30, 2025

MrMPFR said:
Thought RX and MI wouldn't become more µarch aligned given Kepler's earlier statements on UDNA.

Nvm I didn't recall things correctly. Here's the explanation from Kepler:

Kepler_L2 said:
People misunderstood what "UDNA" meant. It's not a single architecture across gaming and datacenter, but a unification of the development pipeline.

CDNA1/2/3/4 have many architecture advancements that are not in RDNA2/3/4 because they are in a completely different architecture branch.

With "UDNA" strategy, development follows a gaming->datacenter->gaming pattern, where advancements from one type of architecture can be re-integrated into the next if they make sense, but the architectures are still different as they don't need to have the same features (i.e gaming doesn't need strong FP64 or extremely large matrix cores, datacenter doesn't need RT/Texture/Geometry/Raster features).

igor_kavinski · Sep 30, 2025

adroc_thurston said:
What's UDNA?

I wonder how long you will try to act dumb and keep asking this question over and over again

AMD was the first one to talk about it. No one else made it up.

AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

Two become one.

www.tomshardware.com

adroc_thurston · Sep 30, 2025

igor_kavinski said:
I wonder how long you will try to act dumb and keep asking this question over and over again

AMD was the first one to talk about it. No one else made it up.

AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

Two become one.

www.tomshardware.com

View attachment 131270

nothing called 'UDNA' exists.

igor_kavinski · Sep 30, 2025

adroc_thurston said:
nothing called 'UDNA' exists.

He. Said. It.

The guy responsible for GPUs at AMD.

Fine. Make him unsay it then.

Tomorrow. On news sites. He apologizes and says that the word UDNA was a slip of the tongue.

MrMPFR · Sep 30, 2025

marees said:
AMD blog on DGF (from feb) — NANITE accelerator for RT use cases]

DGF isn't Nanite it's really just AMD's take NVIDIA's now deprecated Displaced Micro-meshes introduced with Ada Lovelace, albeit with fewer drawbacks and some extra functionality, for example an OMM header (see patent) and I don't recall support for animation in DMM either.
It's a great fit for a clustered BLAS architecture, but that doesn't make it a Nanite accelerator. It's just a geometry compression format. You're gonna need a complete BVH management overhaul for it to actually work with Nanite like NVIDIA's RTX Mega Geometry. AMD still has a long way to go

IDK about the DMM patents Kepler posted ~2 months ago, but perhaps that could be leverage for faster ray/tri evaluation for Catmull Clark subdivisions. An important component considering how heavily RTX Mega Geometry relies on tesselation and subdivisions (read the NVPro-samples github documentation), so any AMD implementation is likely to heavily on this to reduce rebuilds.

Before I go out on a tangent have to preface that this is just speculation and suggestions, so don't take it seriously until we have actual confirmation. Lots of details omitted as well given all the proposed changes from the six linked patents.
Can see you linked the related patents. Lots of unaswered questions remain regarding those DGF and prefiltering patents. For example is the stuff beyond DGF decompression in HW actually in RDNA 5 or just speculation at this point? If true then precomputations in preparation for Quantized BVH + load data for prefiltering = INT based bulk testing of triangles and likely even OBBs to reduce FP tests. INT allows multiple times more RT intersections at iso-power and area. Still some FP evaluations for uncertain results which scales down the speedup somewhat, but should still deliver ludicrous gains to RT if implemented.

In the Animation DGF post they mentioned loading an entire DGF block into LDS, probably with just one memory transaction. This includes animating geo data and OMM, not confirmed in blog but included in linked patent, all neatly organized into one coherent cache aligned package. From here https://patents.google.com/patent/US20250131640A1:
"In an implementation, the intersection scheduler pipelines data decoding requests to the decompressor 500 for multiple rays coalesced against the same node. All the rays coalesced against the same node are sent first before switching to another node. The decompressor 500 switches to the next ray only after decoding the last batch test for the current ray. Further, the decompressor 500 switches to the next node only after decoding the last batch test of the last ray for current node it is processing."

No idea how DGF on RDNA 4 cards differs (SW vs HW): Load entire DGF block, decode in parser storage (sounds like RT cache) then do cachemem coherent (less evictions and reloads) mass ray/tri testing. Sounds like another HW feature boosting perf.

Mopetar · Sep 30, 2025

igor_kavinski said:
He. Said. It.

The guy responsible for GPUs at AMD.

Fine. Make him unsay it then.

Tomorrow. On news sites. He apologizes and says that the word UDNA was a slip of the tongue.

Maybe the code name for the next version of CDNA is UDNA just to confuse the ever loving hell out of everyone. I proposed Project Opposite Day once, just so that when it was terminated the following week no one was quite sure of the actual status of the project. It didn't sow quite as much confusion as Project Withajee, but for some odd reason I've been banned from coming up with project names.

MrMPFR · Sep 30, 2025

marees said:
Interesting: co-compute linked to L3 to avoid cache-thrashing of L1 cache (for memory intensive loads such as RT)

This seems like would need lots of more die area 🤔
Or is this just renaming of RT core to a more generic co-compute core 🤔 🤔

So it seems, but patent specifies it can be any non-local cache, so they could be coupled to Shader Engine private cache, L2, or MALL.

If the reduced L2 (AT2 = 24MB L2 vs Navi 48 = 64MB MALL) is accurate and CCU is leveraged for RDNA 5 then those being coupled to L2 would shrink L2 for other processes since they require sizeable dedicated. Wonder how AMD engineers would tackle this. A SE cache implementation could happen as well, but it would require a much bigger SE cache, but some other benefits like superior cache latency, closer integration with CUs (routing and latency) etc... Latter seems more likely given the entire thing about the supposedly (not confirmed IIRC) overhaul with autonomous SE scheduling and dispatch (WGS and ADC). In this case CCUs outside SEs would complicate things alot.
There's also a virtual CCU implementation. The patent doesn't clearly specify the difference but sounds like pooled CCU ressources managed by a central scheduler instead of one private CCU for each CU. This one directly conflicts with the WGP Local launch patent, a major HW optimization for workgraps, so still lean towards a non virtual CU-CCU direct link.

Remember CCU offloads work from CU, so overhead might be lower than what it seems. No need to duplicate instructions, but yeah still some overhead, but how much?

Highly doubt that. But some BW heavy RT instructions could be offloaded to CCUs.

marees · Sep 30, 2025

MrMPFR said:
So it seems, but patent specifies it can be any non-local cache, so they could be coupled to Shader Engine private cache, L2, or MALL.

If the reduced L2 (AT2 = 24MB L2 vs Navi 48 = 64MB MALL) is accurate and CCU is leveraged for RDNA 5 then those being coupled to L2 would shrink L2 for other processes since they require sizeable dedicated. Wonder how AMD engineers would tackle this. A SE cache implementation could happen as well, but it would require a much bigger SE cache, but some other benefits like superior cache latency, closer integration with CUs (routing and latency) etc... Latter seems more likely given the entire thing about the supposedly (not confirmed IIRC) overhaul with autonomous SE scheduling and dispatch (WGS and ADC). In this case CCUs outside SEs would complicate things alot.
There's also a virtual CCU implementation. The patent doesn't clearly specify the difference but sounds like pooled CCU ressources managed by a central scheduler instead of one private CCU for each CU. This one directly conflicts with the WGP Local launch patent, a major HW optimization for workgraps, so still lean towards a non virtual CU-CCU direct link.

Remember CCU offloads work from CU, so overhead might be lower than what it seems. No need to duplicate instructions, but yeah still some overhead, but how much?

Highly doubt that. But some BW heavy RT instructions could be offloaded to CCUs.

Is this CCU restricted only to RT cores or can it do the matrix*vector multiplication of the tensor cores too ??

itsmydamnation · Sep 30, 2025

igor_kavinski said:
He. Said. It.

The guy responsible for GPUs at AMD.

Fine. Make him unsay it then.

Tomorrow. On news sites. He apologizes and says that the word UDNA was a slip of the tongue.

You see that in the same interview he calls udna a strategy when asked about merging of cnda and rnda.

So make him unsay what exactly.

MrMPFR · Sep 30, 2025

marees said:
Is this CCU restricted only to RT cores or can it do the matrix*vector multiplication of the tensor cores too ??

The patent is vague so it could probably apply to whatever instructions AMD thinks they need to offload: ML, RT and other cache greedy instructions.

Biggest implications might not be for RDNA 5 but actually MI500. Maybe offload GMV to AID and take advantage of that massive MALL slab? Or perhaps a clever cache hierarchy bypass mode (not in patent), allowing cores to bypass local registers and XCD cache and store everything in AID MALL slab? But all this could be less important if HBM4 with PIM gets rolled out in 2027.
A superficial description of PIM for GPT workoads: https://www.servethehome.com/samsung-processing-in-memory-technology-at-hot-chips-2023/

adroc_thurston · Sep 30, 2025

MrMPFR said:
Biggest implications might not be for RDNA 5 but actually MI500. Maybe offload GMV to AID and take advantage of that massive MALL slab? Or perhaps a clever cache hierarchy bypass mode (not in patent), allowing cores to bypass local registers and XCD cache and store everything in AID MALL slab? But all this could be less important if HBM4 with PIM gets rolled out in 2027.

there is no MALL there

MrMPFR · Sep 30, 2025

Hmm. So MI400 and MI500 also deprecates MALL?

Kepler doesn't seem certain here, but perhaps that only applies to true chiplet consumer GPUs not DC.

Kepler_L2 said:
For actual chiplet stuff (SED+AID+MID) I think so, but for the quasi-monolithic GPUs like ATx with just GMD+MID it's gone.

adroc_thurston · Sep 30, 2025

MrMPFR said:
So MI400 and MI500 also deprecates MALL?

Yeah why not.

MrMPFR said:
Kepler doesn't seem certain here, but perhaps that only applies to true chiplet consumer GPUs not DC.

MALL's dead because SRAM is kinda expensive.

ToTTenTranz · Sep 30, 2025

adroc_thurston said:
Yeah why not.

MALL's dead because SRAM is kinda expensive.

SRAM should start getting less expensive because N2 and lower finally show a substantial increase in transistor density for memory cells.

It seems to me that AMD looked at GDDR7's big bandwidth boost over GDDR6 (and LP6 over LP5X) and thought they could reduce the cache footprint on their GPU dies, probably with an increase in L2 that gets a better hitrate-per-MB than L3 (or maybe it's just a lot faster which gets them higher .effective bandwidth).

MrMPFR · Sep 30, 2025

ToTTenTranz said:
SRAM should start getting less expensive because N2 and lower finally show a substantial increase in transistor density for memory cells.

It's only roughly matching the pathetic 20% logic scaling with N2 after being completely stagnant for one gen (N5->N3) and underwhelming for another gen (N7->N5). Things don't look better for Angstrom era nodes. Complete joke.
Given +50% wafer premiums I really don't see any GPUs besides ultra high end and DC moving beyond N3 anytime soon. On-chip photonics and other paradigm shifts can't come soon enough.

soresu · Sep 30, 2025

Mopetar said:
Maybe the code name for the next version of CDNA is UDNA just to confuse the ever loving hell out of everyone. I proposed Project Opposite Day once, just so that when it was terminated the following week no one was quite sure of the actual status of the project. It didn't sow quite as much confusion as Project Withajee, but for some odd reason I've been banned from coming up with project names.

Oh lol you delicious troll 🤣😂

RnR_au · Oct 1, 2025

itsmydamnation said:
You see that in the same interview he calls udna a strategy when asked about merging of cnda and rnda.

So make him unsay what exactly.

And I think that interview is the only place where AMD have uttered the word 'UDNA'. It wasn't mentioned in a later finance day... they just talked about CDNA-Next.

Joe NYC · Oct 1, 2025

adroc_thurston said:
Yeah why not.

In Mi400 as well

adroc_thurston said:
MALL's dead because SRAM is kinda expensive.

Are they still doing the stacking? MALL must have been a small percentage of the die...

Win2012R2 · Oct 1, 2025

ToTTenTranz said:
SRAM should start getting less expensive because N2 and lower finally show a substantial increase in transistor density for memory cells.

Offset by higher wafer costs

adroc_thurston · Oct 1, 2025

Joe NYC said:
Are they still doing the stacking? MALL must have been a small percentage of the die...

Their cache hierarchy is just different now.

ToTTenTranz · Oct 1, 2025

Win2012R2 said:
Offset by higher wafer costs

Not if you go with Samsung's foundries.

adroc_thurston · Oct 1, 2025

ToTTenTranz said:
Not if you go with Samsung's foundries.

Yes I too love sub-3% yield on >300mm^2 die.

igor_kavinski · Oct 2, 2025

That just shows me that Nvidia has more to lose and AMD more to gain when the tides turn.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Senior member

Platinum Member

AMD Dense Geometry Compression Format SDK​

Member

Lifer

Diamond Member

Lifer

Member

Diamond Member

Member

Platinum Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Senior member

Member

Diamond Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Lifer

AMD Dense Geometry Compression Format SDK