Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 76 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

reaperrr3

Member
May 31, 2024
154
456
96
It’s been a while since we saw those leaked specs. I doubt AT0 is still alive.
lol, why :laughing:
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.
 
  • Like
Reactions: Tlh97

dangerman1337

Senior member
Sep 16, 2010
416
60
91
lol, why :laughing:
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.
Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.

I mean RTX 60/GeForce Rubin leaks are absolutely threadbare, even the venerable Kopite7Kimi hasn't leaked anything. Just feels all... weird?
 

reaperrr3

Member
May 31, 2024
154
456
96
Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.
Technically we don't, no.
But even with all the money they have, I just don't see Nvidia doing both CPX and a separate big gaming-focused chip.
More precisely, I think it would be naive, wishful thinking to believe they would.

Why spend hundreds of millions of dollars extra on a 2nd big die, if you can just make the AI-focused one a bit faster than GB202 in gaming and be done with it?
Gaming cards are too low-margin to bother with an extra chip (from NV's perspective, not ours ofc).
It's more cost-efficient to just make an AI-focused chip that is also capable of running games some 20-30% faster than GB202, and sell the bad salvage bins to gamers.
I could even see them go back to 384bit for the 6090, just with faster 3GB chips (hey, 4GB extra in total!). Good enough for the gamer peasants (again, their perspective), and lets them sell dies with some defective mem interfaces or -controllers, too.

If we get lucky and they get generous, they might even upgrade the Gxxx3 chip to 96 SM, clock it high enough to reach 4090 perf, and throw in a whopping 24GB for the 6080 at launch, for only 1499$. /s
 
  • Like
Reactions: Tlh97 and marees

luro

Member
Dec 11, 2022
94
124
76
lol, why :laughing:
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.
Because it’s AMD who has a fame of not shipping halo cards
 

MrMPFR

Member
Aug 9, 2025
153
304
96
Well, maybe it is simply not required to have more memory bandwidth:
- Revamped CUs and respective low level caches (bigger capacity)
- Out-of-order execution (increase hardware utilization of ALUs and cache)
- Maybe L0 cache sharing across multiple CUs (reduce wasted SRAM capacity, reduce LLC & DRAM bandwidth requirements)
- Universal compression (smaller memory footprint, reduce bandwidth requirements)
- DGF & DMM (smaller memory footprint, reduce bandwidth requirements)
- Neural techniques like NTC which aim to reduce data fetching from DRAM but rather use more compute from matrix engines (whose performance mostly rely on CU low level caches) to generate or extract data and information
- Work graphs and procedural algorithms with dynamic execution on CU level (reduces code footprints and reduces bandwidth pressure from higher level caches and DRAM)

All those things aim to maximize usage of low level CU resources, increase data locality and reduce load on higher level structures like LLC and DRAM.
It seems that there is much going on regarding rethinking GPU architecture as a whole.
This is a great summary but perhaps AMD wants to go even further than this. It really depends on just how clean slate GFX13 is.

#1: Maybe instead of bigger caches a universal M3-esque flexible cache to maximize compute/area and cachemem overhead.
#2: Hopefully without massive area overhead. GhOST paper indicated this is plausible
#3: Yeah like the 2020 AMD paper. Flexible clustering and global/private management based on compiler and other hints. Combined with dataflow execution this could be a gamechanger for ML. Much less pressure on L2 and VRAM. Maybe it could be expanded to other forms of WGP caches and register files. Perhaps the WGP VGPR takeover mode Cerny talked about during Road to PS5 Pro talk could be extended across multiple WGPs.
#5: This is probably a compute-to-cache tradeoff but yeah a considerable benefit on-die cachemem usage.
#7: Hopefully well beyond that.
8 days ago NVIDIA published this research paper suggesting it's possible to basically bolt dataflow execution onto existing architectures with only modest adjustments, although still far from being feature complete (see section 7). Despite this sizeable speedups and reductions in VRAM BW traffic were achieved for inference and training. And interestingly at section 8 it clearly outlines how Kitsune is leveraging tile programming, mirroring recent moves with CUDA Tile, and also how it's generally far more applicable than Work Graphs, that's generally limited to shader pipelines.
I'm bringing this up because AMD is already exploring a dataflow API paradigm shift with Work Graphs, and as a result why not go all the way and implement sweeping changes on the HW and compiler side to fundamentally change how workloads are managed on GPUs. While workgraphs might be a push nextgen graphics API standardization, even with the impressive patent derived optimizations I doubt it gives anywhere close to the full picture of what GFX13 and later could be capable off in terms of compute and ML perf. They would prob need a brand new API and clean slate compiler to fully tap into this.

Some other considerations to reduce memory and cache pressure (far from complete):
Summarizing prev info
- Decentralized and locally autonomous distributed scheduling and dispatch (less pressure on higher level caches)
- Mapping data accesses to exploit ^ (^)
- Leaf nodes (ray/tri): Prefiltering and DGF nodes = parallel fixed-point testers (increased intersections/kB of cache)
- Payload sorting (^^^)
- Deferred any-hit shaders (increased cache and ALU utilization)

New
- TLAS and upper BLAS (ray/box): Sorting rays into coherent bundles to be tested together to reduce redundant calculations (less cachemem overhead)
- Sophisticated lookup tables to reuse expensive calculations (trancendental) math, more general vector calculations (^)

The tail end end of Moore's Law demands every stone to be turned and I just hope AMD has taken the bold route rather than the cautious one. We'll see in ~2027.
 

MrMPFR

Member
Aug 9, 2025
153
304
96
Just read the excellent blogpost by Sebastian Aaltonen shared by @Gideon last week. Shocking how flawed the "modern APIs" are and new ones can't come soon enough. DX12's legacy bloat with Work Graphs bolted on top would hold back post-crossgen releases.
Compare that with a feature complete No Graphics API (DX13 and Vulkan 2.0) with accommodations (native design + extensions) for dataflow execution architecture, as described in my prev comment, that could greatly benefit the 10th era of gaming. Basically sounds like DX13 + WGs on steroids.
Especially true for developers that can't afford wiz SWEs as highlighted by @marees post. The API's design philosophy mean it's "...simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan." Oh and someone is working on an actual API implementation.


Some hypothetical changes and implications summarized below
- Grain of salt advised, no professional background

Sebbi's No graphics API:
  • Unified memory
  • 64-bit GPU pointers everywhere
  • Shaders = C++ like kernels
  • Bindless everything
  • Raster/RT as libraries and intrinsics
  • No descriptors
  • No PSOs, permutations and pipeline caching
  • No resource types
  • No barriers
  • No stateful driver
  • No heap enumeration
  • No memory type guessing
  • No legacy shader languages

DX12 → DX13 + Dataflow extensions:
- Command buffers → dataflow graphs
- Ressource objects → pointers
- Descriptors → bindless
- PSOs → dynamic pipelines
- CPU orchestration → GPU autonomy
- Fixed pipelines → unified compute
- Legacy bloated APIs → sleek modern API
- Bloated driver -> thin driver


Fingers crossed RDNA5 and PS6 goes all the way architecturally and even if they don't a hypothetical DX13 still sounds much better than DX12 + WGs. Sounds like we're in for an inevitable programming paradigm shift of a similar magnitude to pure fixed function → programmable shaders in the early 2000s. Add ML and PT on top and the 2030s will be truly nextgen.
 
  • Like
Reactions: Elfear and marees