Discussion RDNA 5 / UDNA (CDNA Next) speculation

MrMPFR · Oct 18, 2025

Tuna-Fish said:
workloads that are doing graphics as compute

You mean compute shaders and mesh shaders?

basix said:
Would be interesting to see, yes. All new console generations resulted in new optimizations and paradigm shifts. Such a shared L1$ with Work Graphs and Neural Rendering might just be the next thing.

No doubt more changes are coming. Work Graphs encompasses so many things (foundational API change). But outside of limited first party lineup truly nextgen is AFTER crossgen unfortunately.
Anyone want to guess how long crossgen will last nextgen?

adroc_thurston said:
Crossgen periods kill whatever you think gonna happen.

MLP based neural rendering just builds on top of RT/PT crossgen pipeline but yeah foundational changes enabled by work graphs not coming anytime soon unfortunately. Really the same thing with MLPs in a lot of instances I fear unless is plug and play SDKs like FSR or sponsored games.

Tuna-Fish · Oct 18, 2025

MrMPFR said:
You mean compute shaders and mesh shaders?

To be clear, not all such implementations in general, you have to choose to make use of it. But if you do anything complicated to a large dataset in compute shaders, there's probably some optimizations you can apply using dsmem.

basix · Oct 22, 2025

adroc_thurston said:
Crossgen periods kill whatever you think gonna happen.

Sure, crossgen will make transitions slow. But having it in place with PS6 might bring wonders with the next-next crossgen (PS6 and PS7), where neural rendering and work graphs are not the new shit but available with wide hardware and software support.

PS5 and PS6 crossgen will bring:
- Fast SSD loading and streaming as basis
- Virtualized geometry everywhere (nanite, meshlets)
- RT/PT graphics everywhere
- Temporal Upscalers & Frame Generation (worse quality on PS5)
- Some lightweight DNN acceleration and neural rendering stuff which can run on PS5 as well (animations, NPC behavior, physics, audio, ...)

Even without extensively using neural rendering and work graphs, that are already many neat things. Fast loading, seamless streaming and immersion enhancers (animations etc.) are more important to me than a little bit better raytracing or textures.
Neural Rendering and Work Graphs are potentially backwards compatibility breaking, depending on their usage and developer workflows (e.g. neural textures might not be the first thing we will see on PS6)

MrMPFR · Oct 23, 2025

basix said:
RT/PT graphics everywhere

Yeah and if a priority devs can integrate neural rendering MLPs alongside GATE across PS5 Pro and PS6 on top of RT/PT.

basix said:
(e.g. neural textures might not be the first thing we will see on PS6)

NTC Inference on load is totally doable on PS5. 6-7X storage GB and IO GB/s multiplier for textures = big incentive for mass adoption.
But NTC on sample is prohibitively expensive and ~~rn NTC destroys fine detail:~~ One of the NTC devs (you can find him in the YT comment section) confirmed the reference material lacks specular detail in vid. The NTC material is how it's supposed to look. If there's an issue it's due to STF, not NTC.

I remember reading about some NVIDIA research papers earlier this year for improved texture filtering (STF) for NTC. Don't know if they included in the demo ~~and don't think it'll be enough to solve artifacting~~. Ignore

NTC needs more time in the oven. Yes but for different reasons.

basix said:
Neural Rendering and Work Graphs are potentially backwards compatibility breaking, depending on their usage and developer workflows (e.g. neural textures might not be the first thing we will see on PS6)

Unless devs move on to NeRFs (rn only static content) and diffusion based rendering they will always require an RT/PT foundation. Devs can leverage that to prolong crossgen and maximize adressable market.

Work graphs is more tricky. MS said new and old version can be used alongside each other. Likely repeat of DX11 + DX12 modes in the early days of DX12. Optimizations could be group material shaders by node, recursion for multi-bounce GI etc... Build a game around it = PS6 only.

basix · Oct 24, 2025

MrMPFR said:
NTC Inference on load is totally doable on PS5. 6-7X storage GB and IO GB/s multiplier for textures = big incentive for mass adoption.
But NTC on sample is prohibitively expensive and rn NTC destroys fine detail:

Well, NTC inference on sample adds +1ms on a 4060 in 1080p. Otherwise (NTC inference on load) you get by far less compression ratio (lower than standard block compression with 4x or 8x, but increased quality nevertheless).

MrMPFR · Oct 24, 2025

basix said:
Well, NTC inference on sample adds +1ms on a 4060 in 1080p. Otherwise (NTC inference on load) you get by far less compression ratio (lower than standard block compression with 4x or 8x, but increased quality nevertheless).

~~I didn't know that compression ratio was inferior. Makes tech completely irrelevant for now.~~ Ignore NTC is transcoded to BCn at runtime when using on load. Compression ratio is identical.

Was going to write a reply but there's too many unknowns about NTC. Compression ratio vs BCn including whether or not to believe NVIDIA's claims at NTC github page. Is there support for tiled textures and culling to limit VRAM usage and inference. It seems like it's either a full inference on sample or inference on load by decoding small texture tiles based on whether they're occluded or not using sampler feedback. Tech is still in beta but how far is it from being production ready. Damn what an impressive word salad. Again ignore.

Should prob ignore until it's production ready. I just hope tech is ready by nextgen and that additional SW side optimizations and the ML perf uplift of RDNA 5 and 60 series will make the ms overhead insignificant.

soresu · Oct 24, 2025

MrMPFR said:
I remember reading about some NVIDIA research papers earlier this year for improved texture filtering (STF) for NTC

I think you mean this.

Improved Stochastic Texture Filtering Through Sample Reuse

Abstract:

Stochastic texture filtering (STF) has re-emerged as a technique that can bring down the cost of texture filtering of advanced texture compression methods, e.g., neural texture compression. However, during texture magnification, the swapped order of filtering and shading with STF can result in aliasing. The inability to smoothly interpolate material properties stored in textures, such as surface normals, leads to potentially undesirable appearance changes. We present a novel method to improve the quality of stochastically-filtered magnified textures and reduce the image difference compared to traditional texture filtering. When textures are magnified, nearby pixels filter similar sets of texels and we introduce techniques for sharing texel values among pixels with only a small increase in cost (0.04--0.14~ms per frame). We propose an improvement to weighted importance sampling that guarantees that our method never increases error beyond single-sample stochastic texture filtering. Under high magnification, our method has >10 dB higher PSNR than single-sample STF. Our results show greatly improved image quality both with and without spatiotemporal denoising.

basix · Oct 25, 2025

MrMPFR said:
I didn't know that compression ratio was inferior. Makes tech completely irrelevant for now.

It is difficult to assess. Yes, it takes more memory. But seems to be of higher quality. You could then reduce texture resolution but in the end "on load" is a performance focused compromise. It would at least allow to use the same production pipeline for textures for systems with different capabilities. You do not want to have two different texture workflows.

There was another NTC related denoising paper in June 2025, which showed very good results:

https://research.nvidia.com/labs/rtr/publication/akeninemoller2025collaborative/collaborative_texfilt.pdf

soresu · Oct 25, 2025

basix said:
There was another NTC related denoising paper in June 2025

This doesn't appear to be NTC specific, it seems more about filtering after decompression of the texture*, as it also covers block based DCT decompression.

*specifically under magnification - close up shots?

MrMPFR · Oct 25, 2025

soresu said:
I think you mean this.

Yes and the CTF paper shared by @basix.

basix said:
It is difficult to assess. Yes, it takes more memory. But seems to be of higher quality. You could then reduce texture resolution but in the end "on load" is a performance focused compromise. It would at least allow to use the same production pipeline for textures for systems with different capabilities. You do not want to have two different texture workflows.

The NTC github page lists on load = BCn in VRAM footprint but IDK. Asked the NTC dev about it so we'll see.

Might be possible. The Intel NTBC HPG paper method def looks much better, but that is also ahead of NTC implementation, well at least for now.

Indeed. For DX12U compliant systems with on feedback you can get some of the benefits of on sample without the downsides. Only decompress what hasn't been culled = big VRAM savings, just not as massive as on sample. PS5 doesn't support it but can still benefit from smaller game file sizes and less IO traffic (even faster loading).

soresu said:
This doesn't appear to be NTC specific, it seems more about filtering after decompression of the texture*, as it also covers block based DCT decompression.

*specifically under magnification - close up shots?

It is. As per latest RTXTF version log earlier wave comm. STF pipeline joined by the new CTF pipeline with box sampling + the one-tap STF as fallback. RTXTF is also mentioned on the RTXNTC Github page.

The HPG talk here also mentions using NTC alongside CTF and explains how it works.

soresu · Oct 25, 2025

MrMPFR said:
It is. As per latest RTXTF version log earlier wave comm. STF pipeline joined by the new CTF pipeline with box sampling + the one-tap STF as fallback. RTXTF is also mentioned on the RTXNTC Github page.
You can also watch the HPG talk here which explains how it works.

I swear to god if nVidia name one more damn thing RTXblahblah I'm not going to buy their stuff for another 16 years 🤣😂

It adds nothing and just contributes to confusion between gfx hw and sw.

basix · Oct 25, 2025

MrMPFR said:
The NTC github page lists on load = BCn in VRAM footprint but IDK. Asked the NTC dev about it so we'll see.

Ah it seems disk size with "on load" is the same as with regular NTC, but VRAM requirements are on par with BCn. Didn't know that. Maybe the effective BCn compression ratio is <4x because of metadata or whatever.

"on load" is still a net-win then. But your 8GB card will not benefit from that.

adroc_thurston · Oct 25, 2025

basix said:
"on load" is still a net-win then. But your 8GB card will not benefit from that.

Ergo pointless.
NAND is cheap. DRAM ain't.

basix · Oct 25, 2025

Tell that to 200 GByte downloads to disk

adroc_thurston · Oct 25, 2025

basix said:
Tell that to 200 GByte downloads to disk

NAND is cheap.
DRAM isn't.

basix · Oct 26, 2025

Nice, re-iterate your stuff. Helps fruitful discussions. Not.

NAND might be much cheaper, nobody is denying that. But nevertheless you waste money on stuff you do not need, when you can compress to a higher degree. Buying a smaller SSD saves money for everyone, you included.
Why did PS5 in Europe go back this fall from 1000GB to 825GB? Because NAND is a$$-cheap or because it still costs some $$$?

And then the point I made about production pipelines across platforms. "On-load" is a fallback mechanism for platforms which can't execute "on-sample" fast enough. But you still use NTC in your game and engine.
And regarding the 8GB comment: At least Nvidia GPUs since Lovelace should be able to manage "on-sample". So I would love to see NTC as soon as possible for those card owners (4060 etc.). "On-load" is for older GPUs or potentially current-gen consoles.
And not everybody has super fast internet connections. Reducing the package from let's say 200GB to 50GB would be welcomed by many people.

Then OK, re-iterate your "on-load scheme is usless because NAND is cheap". But that is simply wrong.

adroc_thurston · Oct 26, 2025

basix said:
Nice, re-iterate your stuff. Helps fruitful discussions. Not.

My point is absolute and really trivial to understand.

basix said:
Then OK, re-iterate your "on-load scheme is usless because NAND is cheap". But it is simply wrong.

NAND is cheap. Compute cycles on GPU are not. DRAM is definitely not.
If you're doing frametime-expensive texture compression, it has to be actually worth it in terms of DRAM footprint to matter.
Like CoD installs aren't 200gigs for kicks, they're all uncompressed assets to free up precious-precious CPU cycles.

basix · Oct 26, 2025

Much better this time

It is easy to understand, yes. But you do not need to repeat it, I can manage to understand it the first time you say it. Giving some additional context around it helps much more

The frametime-expensive part is a valid point. But how expensive is it actually? 1ms on a 4060 in 1080p or tripled forward pass time. Intel shows roughly doubled forwared pass time compared to BCn. Either spend that time or get crippled by 8GB? I know what I would choose.
Compute cycles on the GPU are quasi-free for now, because Tensor Cores get barely utilized today. So using them, when your GPU features them anyways, reduces dark silicon. Yes, frametime gets worse. But I believe that the performance of NTC will get improved in the future to close the biggest part of the performance gap to BCn.

For platforms without enough processing power to do NTC with "on-sample" you get something else: Compared to BCn no performance regression with "on-load". So no drawback in that regard (but also no benefit regarding VRAM). But you still get the benefit of reduced disk storage space. So why oppose that scheme, when you do not have any performance or quality drawbacks with "on-load" compared to BCn?

adroc_thurston · Oct 26, 2025

basix said:
The frametime-expensive part is a valid point. But how expensive is it actually? 1ms on a 4060 in 1080p or tripled forward pass time. Intel shows roughly doubled forwared pass time compared to BCn. Either spend that time or get crippled by 8GB? I know what I would choose.

Consoles won't be crippled by the 8G fbuf limit and they're the baseline h/w target.

basix said:
Compute cycles on the GPU are quasi-free for now, because Tensor Cores get barely utilized today

GEMM engines shred your VRF b/w unless you're DC blackwell and have TMEM to feed them.

basix · Oct 26, 2025

adroc_thurston said:
Consoles won't be crippled by the 8G fbuf limit and they're the baseline h/w target.

Sure. But they won't do "on-sample" NTC, therefore use more VRAM for textures (roughly same as today) but not suffering from a performance hit. And still benefitting from reduced data size on disk. Isn't that a win for consoles like PS5? Even if it might be just a small win?

adroc_thurston said:
GEMM engines shred your VRF b/w unless you're DC blackwell and have TMEM to feed them.

"Free" was not meant regarding performance, but from a HW cost perspective. Tensor Cores are already baked into the GPU. And because they are barely utilized today, you do not eat into other Tensor Core use-cases (which means you do not to have to add additional Tensor Cores for NTC). Therefore, adding a new NTC use-case to existing HW units makes it "quasi-free" regarding HW cost.
So you can spend some compute- and frametime on GPUs like 4060 to mitigate crippling effects of the small 8GB framebuffer. Extends the lifetime of such cards at the cost of somewhat reduced framerates (which might get recovered with a higher upsampling ratio from DLSS etc.).

Not ideal, but still a win in my books.

soresu · Oct 26, 2025

basix said:
Sure. But they won't do "on-sample" NTC, therefore use more VRAM for textures (roughly same as today) but not suffering from a performance hit. And still benefitting from reduced data size on disk. Isn't that a win for consoles like PS5? Even if it might be just a small win?

"Free" was not meant regarding performance, but from a HW cost perspective. Tensor Cores are already baked into the GPU. And because they are barely utilized today, you do not eat into other Tensor Core use-cases (which means you do not to have to add additional Tensor Cores for NTC). Therefore, adding a new NTC use-case to existing HW units makes it "quasi-free" regarding HW cost.
So you can spend some compute- and frametime on GPUs like 4060 to mitigate crippling effects of the small 8GB framebuffer. Extends the lifetime of such cards at the cost of somewhat reduced framerates (which might get recovered with a higher upsampling ratio from DLSS etc.).

Not ideal, but still a win in my books.

What confuses the hell out of me is why VRAM seems so limited in capacity compared to main system RAM.

16 GB is less than a single DIMM even in the consumer space for more than half a decade.

My own AM4 set up has 4x32 GB DIMMs, so frankly I'm just at a loss to explain what is so different about gfx that they can't adopt a similar strategy like this (as yet unverified) Bolt Graphics claims to be doing with SODIMMs.

jpiniero · Oct 26, 2025

soresu said:
What confuses the hell out of me is why VRAM seems so limited in capacity compared to main system RAM.

16 GB is less than a single DIMM even in the consumer space for more than half a decade.

GDDR capacity increases in general has moved pretty slowly.

soresu · Oct 26, 2025

jpiniero said:
GDDR capacity increases in general has moved pretty slowly.

Certainly by comparison to consumer/client system memory ye.

When you compare to server RAM it just becomes a joke.

Tuna-Fish · Oct 26, 2025

Partly, the higher-throughput interface takes up more die space on the chip, which means that for the ideal-sized die you can fit less caps and so you have less capacity.

Partly that is about what the GPU manufacturers demand from the memory makers. If they wanted, they could always ask for narrower bitwidth chips, which let them use more of them for a given interface width. Channels on GDDR7 are 8-bit wide, with a typical chip supporting 4 of them for a total width of 32 bits. If you used 24Gbit chips but implemented them with one channel per chip, you could use up to 48GB of ram on a 128-bit card.

The memory would of course cost more, unless you were NVidia because they are the only company with sufficient demand to justify such a boutique part.

soresu · Oct 26, 2025

Tuna-Fish said:
Partly, the higher-throughput interface takes up more die space on the chip, which means that for the ideal-sized die you can fit less caps and so you have less capacity.

Caps = DRAM cell capacitors?

So hypothetically future capacitor-less DRAM could be a big boon for VRAM?

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Member

Golden Member

Senior member

Member

Senior member

Member

Diamond Member

Improved Stochastic Texture Filtering Through Sample Reuse​

Senior member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Improved Stochastic Texture Filtering Through Sample Reuse