Discussion RDNA 5 / UDNA (CDNA Next) speculation

jpiniero · Monday at 3:47 PM

I was thinking more that the busted AT0 would end up being used for AI cards.

adroc_thurston · Monday at 3:59 PM

dangerman1337 said:
I mean AMD could release a 48GB AT0 (or 64 if they feel like doing 4GB modules, or leave those for pro cards) full fat-SKU against a 6090/6090 Ti if they feel it

It ain't gonna win anything with shader core count parity.

luro · Monday at 5:53 PM

It’s been a while since we saw those leaked specs. I doubt AT0 is still alive.

reaperrr3 · Monday at 6:27 PM

luro said:
It’s been a while since we saw those leaked specs. I doubt AT0 is still alive.

lol, why

AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

dangerman1337 · Monday at 7:22 PM

reaperrr3 said:
lol, why
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.

I mean RTX 60/GeForce Rubin leaks are absolutely threadbare, even the venerable Kopite7Kimi hasn't leaked anything. Just feels all... weird?

reaperrr3 · Monday at 8:09 PM

dangerman1337 said:
Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.

Technically we don't, no.
But even with all the money they have, I just don't see Nvidia doing both CPX and a separate big gaming-focused chip.
More precisely, I think it would be naive, wishful thinking to believe they would.

Why spend hundreds of millions of dollars extra on a 2nd big die, if you can just make the AI-focused one a bit faster than GB202 in gaming and be done with it?
Gaming cards are too low-margin to bother with an extra chip (from NV's perspective, not ours ofc).
It's more cost-efficient to just make an AI-focused chip that is also capable of running games some 20-30% faster than GB202, and sell the bad salvage bins to gamers.
I could even see them go back to 384bit for the 6090, just with faster 3GB chips (hey, 4GB extra in total!). Good enough for the gamer peasants (again, their perspective), and lets them sell dies with some defective mem interfaces or -controllers, too.

If we get lucky and they get generous, they might even upgrade the Gxxx3 chip to 96 SM, clock it high enough to reach 4090 perf, and throw in a whopping 24GB for the 6080 at launch, for only 1499$. /s

luro · Monday at 8:58 PM

reaperrr3 said:
lol, why
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

Because it’s AMD who has a fame of not shipping halo cards

adroc_thurston · Monday at 9:18 PM

luro said:
Because it’s AMD who has a fame of not shipping halo cards

oh it's not a halo part at all.
A real halo part would be tiled with like 2x the SM count.

MrMPFR · 2025-12-23T12:14:54-0500

basix said:
Well, maybe it is simply not required to have more memory bandwidth:
- Revamped CUs and respective low level caches (bigger capacity)
- Out-of-order execution (increase hardware utilization of ALUs and cache)
- Maybe L0 cache sharing across multiple CUs (reduce wasted SRAM capacity, reduce LLC & DRAM bandwidth requirements)
- Universal compression (smaller memory footprint, reduce bandwidth requirements)
- DGF & DMM (smaller memory footprint, reduce bandwidth requirements)
- Neural techniques like NTC which aim to reduce data fetching from DRAM but rather use more compute from matrix engines (whose performance mostly rely on CU low level caches) to generate or extract data and information
- Work graphs and procedural algorithms with dynamic execution on CU level (reduces code footprints and reduces bandwidth pressure from higher level caches and DRAM)

All those things aim to maximize usage of low level CU resources, increase data locality and reduce load on higher level structures like LLC and DRAM.
It seems that there is much going on regarding rethinking GPU architecture as a whole.

This is a great summary but perhaps AMD wants to go even further than this. It really depends on just how clean slate GFX13 is.

#1: Maybe instead of bigger caches a universal M3-esque flexible cache to maximize compute/area and cachemem overhead.
#2: Hopefully without massive area overhead. GhOST paper indicated this is plausible
#3: Yeah like the 2020 AMD paper. Flexible clustering and global/private management based on compiler and other hints. Combined with dataflow execution this could be a gamechanger for ML. Much less pressure on L2 and VRAM. Maybe it could be expanded to other forms of WGP caches and register files. Perhaps the WGP VGPR takeover mode Cerny talked about during Road to PS5 Pro talk could be extended across multiple WGPs.
#5: This is probably a compute-to-cache tradeoff but yeah a considerable benefit on-die cachemem usage.
#7: Hopefully well beyond that.
8 days ago NVIDIA published this research paper suggesting it's possible to basically bolt dataflow execution onto existing architectures with only modest adjustments, although still far from being feature complete (see section 7). Despite this sizeable speedups and reductions in VRAM BW traffic were achieved for inference and training. And interestingly at section 8 it clearly outlines how Kitsune is leveraging tile programming, mirroring recent moves with CUDA Tile, and also how it's generally far more applicable than Work Graphs, that's generally limited to shader pipelines.
I'm bringing this up because AMD is already exploring a dataflow API paradigm shift with Work Graphs, and as a result why not go all the way and implement sweeping changes on the HW and compiler side to fundamentally change how workloads are managed on GPUs. While workgraphs might be a push nextgen graphics API standardization, even with the impressive patent derived optimizations I doubt it gives anywhere close to the full picture of what GFX13 and later could be capable off in terms of compute and ML perf. They would prob need a brand new API and clean slate compiler to fully tap into this.

Some other considerations to reduce memory and cache pressure (far from complete):
Summarizing prev info
- Decentralized and locally autonomous distributed scheduling and dispatch (less pressure on higher level caches)
- Mapping data accesses to exploit ^ (^)
- Leaf nodes (ray/tri): Prefiltering and DGF nodes = parallel fixed-point testers (increased intersections/kB of cache)
- Payload sorting (^^^)
- Deferred any-hit shaders (increased cache and ALU utilization)

New
- TLAS and upper BLAS (ray/box): Sorting rays into coherent bundles to be tested together to reduce redundant calculations (less cachemem overhead)
- Sophisticated lookup tables to reuse expensive calculations (trancendental) math, more general vector calculations (^)

The tail end end of Moore's Law demands every stone to be turned and I just hope AMD has taken the bold route rather than the cautious one. We'll see in ~2027.

luro · 2025-12-24T07:36:08-0500

adroc_thurston said:
oh it's not a halo part at all.
A real halo part would be tiled with like 2x the SM count.

You’r right. I still don’t understand why they refuse to ship that. People already paying 3k+ on 5090’s shows that they would pay even more

MrMPFR · 2025-12-24T09:07:18-0500

Just read the excellent blogpost by Sebastian Aaltonen shared by @Gideon last week. Shocking how flawed the "modern APIs" are and new ones can't come soon enough. DX12's legacy bloat with Work Graphs bolted on top would hold back post-crossgen releases.
Compare that with a feature complete No Graphics API (DX13 and Vulkan 2.0) with accommodations (native design + extensions) for dataflow execution architecture, as described in my prev comment, that could greatly benefit the 10th era of gaming. Basically sounds like DX13 + WGs on steroids.
Especially true for developers that can't afford wiz SWEs as highlighted by @marees post. The API's design philosophy mean it's "...simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan." Oh and someone is working on an actual API implementation.

Some hypothetical changes and implications summarized below
- Grain of salt advised, no professional background

Sebbi's No graphics API:

Unified memory
64-bit GPU pointers everywhere
Shaders = C++ like kernels
Bindless everything
Raster/RT as libraries and intrinsics
No descriptors
No PSOs, permutations and pipeline caching
No resource types
No barriers
No stateful driver
No heap enumeration
No memory type guessing
No legacy shader languages

DX12 → DX13 + Dataflow extensions:
- Command buffers → dataflow graphs
- Ressource objects → pointers
- Descriptors → bindless
- PSOs → dynamic pipelines
- CPU orchestration → GPU autonomy
- Fixed pipelines → unified compute
- Legacy bloated APIs → sleek modern API
- Bloated driver -> thin driver

Fingers crossed RDNA5 and PS6 goes all the way architecturally and even if they don't a hypothetical DX13 still sounds much better than DX12 + WGs. Sounds like we're in for an inevitable programming paradigm shift of a similar magnitude to pure fixed function → programmable shaders in the early 2000s. Add ML and PT on top and the 2030s will be truly nextgen.

marees · 2025-12-24T09:10:21-0500

MrMPFR said:
Bindless everything

Apparently the idea was to bypass cpu latency/overhead by using CUDA like pointers. (No Rust here. O/S needs reliability but Graphics API needs performance)

Search

Discussion RDNA 5 / UDNA (CDNA Next) speculation

jpiniero

Lifer

adroc_thurston

Diamond Member

luro

Member

reaperrr3

Member

dangerman1337

Senior member

reaperrr3

Member

luro

Member

adroc_thurston

Diamond Member

MrMPFR

Member

luro

Member

MrMPFR

Member

marees

Platinum Member

TRENDING THREADS