Discussion RDNA 5 / UDNA (CDNA Next) speculation

jpiniero · Monday at 3:47 PM

I was thinking more that the busted AT0 would end up being used for AI cards.

adroc_thurston · Monday at 3:59 PM

dangerman1337 said:
I mean AMD could release a 48GB AT0 (or 64 if they feel like doing 4GB modules, or leave those for pro cards) full fat-SKU against a 6090/6090 Ti if they feel it

It ain't gonna win anything with shader core count parity.

luro · Monday at 5:53 PM

It’s been a while since we saw those leaked specs. I doubt AT0 is still alive.

reaperrr3 · Monday at 6:27 PM

luro said:
It’s been a while since we saw those leaked specs. I doubt AT0 is still alive.

lol, why

AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

dangerman1337 · Monday at 7:22 PM

reaperrr3 said:
lol, why
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.

I mean RTX 60/GeForce Rubin leaks are absolutely threadbare, even the venerable Kopite7Kimi hasn't leaked anything. Just feels all... weird?

reaperrr3 · Monday at 8:09 PM

dangerman1337 said:
Do we know that "CPX" is the next-gen RTX GeForce flagship die? I'm 50/50 on that personally.

Technically we don't, no.
But even with all the money they have, I just don't see Nvidia doing both CPX and a separate big gaming-focused chip.
More precisely, I think it would be naive, wishful thinking to believe they would.

Why spend hundreds of millions of dollars extra on a 2nd big die, if you can just make the AI-focused one a bit faster than GB202 in gaming and be done with it?
Gaming cards are too low-margin to bother with an extra chip (from NV's perspective, not ours ofc).
It's more cost-efficient to just make an AI-focused chip that is also capable of running games some 20-30% faster than GB202, and sell the bad salvage bins to gamers.
I could even see them go back to 384bit for the 6090, just with faster 3GB chips (hey, 4GB extra in total!). Good enough for the gamer peasants (again, their perspective), and lets them sell dies with some defective mem interfaces or -controllers, too.

If we get lucky and they get generous, they might even upgrade the Gxxx3 chip to 96 SM, clock it high enough to reach 4090 perf, and throw in a whopping 24GB for the 6080 at launch, for only 1499$. /s

luro · Monday at 8:58 PM

reaperrr3 said:
lol, why
AT0 has better prospects than (desktop) AT3/4, imo.

We got zero leaks for Rubin other than that "CPX" die shot that suggests GR102 (or whatever the codename is) stays at 192 SM. You think that means Rubin is dead?
I think it's wrong to assume the current status of any chips based on lack of leaks.
If anything, that's usually a good sign; cancellations often get reported more early and reliably than exact specs.

Because it’s AMD who has a fame of not shipping halo cards

adroc_thurston · Monday at 9:18 PM

luro said:
Because it’s AMD who has a fame of not shipping halo cards

oh it's not a halo part at all.
A real halo part would be tiled with like 2x the SM count.

MrMPFR · Tuesday at 12:14 PM

basix said:
Well, maybe it is simply not required to have more memory bandwidth:
- Revamped CUs and respective low level caches (bigger capacity)
- Out-of-order execution (increase hardware utilization of ALUs and cache)
- Maybe L0 cache sharing across multiple CUs (reduce wasted SRAM capacity, reduce LLC & DRAM bandwidth requirements)
- Universal compression (smaller memory footprint, reduce bandwidth requirements)
- DGF & DMM (smaller memory footprint, reduce bandwidth requirements)
- Neural techniques like NTC which aim to reduce data fetching from DRAM but rather use more compute from matrix engines (whose performance mostly rely on CU low level caches) to generate or extract data and information
- Work graphs and procedural algorithms with dynamic execution on CU level (reduces code footprints and reduces bandwidth pressure from higher level caches and DRAM)

All those things aim to maximize usage of low level CU resources, increase data locality and reduce load on higher level structures like LLC and DRAM.
It seems that there is much going on regarding rethinking GPU architecture as a whole.

This is a great summary but perhaps AMD wants to go even further than this. It really depends on just how clean slate GFX13 is.

#1: Maybe instead of bigger caches a universal M3-esque flexible cache to maximize compute/area and cachemem overhead.
#2: Hopefully without massive area overhead. GhOST paper indicated this is plausible
#3: Yeah like the 2020 AMD paper. Flexible clustering and global/private management based on compiler and other hints. Combined with dataflow execution this could be a gamechanger for ML. Much less pressure on L2 and VRAM. Maybe it could be expanded to other forms of WGP caches and register files. Perhaps the WGP VGPR takeover mode Cerny talked about during Road to PS5 Pro talk could be extended across multiple WGPs.
#5: This is probably a compute-to-cache tradeoff but yeah a considerable benefit on-die cachemem usage.
#7: Hopefully well beyond that.
8 days ago NVIDIA published this research paper suggesting it's possible to basically bolt dataflow execution onto existing architectures with only modest adjustments, although still far from being feature complete (see section 7). Despite this sizeable speedups and reductions in VRAM BW traffic were achieved for inference and training. And interestingly at section 8 it clearly outlines how Kitsune is leveraging tile programming, mirroring recent moves with CUDA Tile, and also how it's generally far more applicable than Work Graphs, that's generally limited to shader pipelines.
I'm bringing this up because AMD is already exploring a dataflow API paradigm shift with Work Graphs, and as a result why not go all the way and implement sweeping changes on the HW and compiler side to fundamentally change how workloads are managed on GPUs. While workgraphs might be a push nextgen graphics API standardization, even with the impressive patent derived optimizations I doubt it gives anywhere close to the full picture of what GFX13 and later could be capable off in terms of compute and ML perf. They would prob need a brand new API and clean slate compiler to fully tap into this.

Some other considerations to reduce memory and cache pressure (far from complete):
Summarizing prev info
- Decentralized and locally autonomous distributed scheduling and dispatch (less pressure on higher level caches)
- Mapping data accesses to exploit ^ (^)
- Leaf nodes (ray/tri): Prefiltering and DGF nodes = parallel fixed-point testers (increased intersections/kB of cache)
- Payload sorting (^^^)
- Deferred any-hit shaders (increased cache and ALU utilization)

New
- TLAS and upper BLAS (ray/box): Sorting rays into coherent bundles to be tested together to reduce redundant calculations (less cachemem overhead)
- Sophisticated lookup tables to reuse expensive calculations (trancendental) math, more general vector calculations (^)

The tail end end of Moore's Law demands every stone to be turned and I just hope AMD has taken the bold route rather than the cautious one. We'll see in ~2027.

luro · Wednesday at 7:36 AM

adroc_thurston said:
oh it's not a halo part at all.
A real halo part would be tiled with like 2x the SM count.

You’r right. I still don’t understand why they refuse to ship that. People already paying 3k+ on 5090’s shows that they would pay even more

MrMPFR · Wednesday at 9:07 AM

Just read the excellent blogpost by Sebastian Aaltonen shared by @Gideon last week. Shocking how flawed the "modern APIs" are and new ones can't come soon enough. DX12's legacy bloat with Work Graphs bolted on top would hold back post-crossgen releases.
Compare that with a feature complete No Graphics API (DX13 and Vulkan 2.0) with accommodations (native design + extensions) for dataflow execution architecture, as described in my prev comment, that could greatly benefit the 10th era of gaming. Basically sounds like DX13 + WGs on steroids.
Especially true for developers that can't afford wiz SWEs as highlighted by @marees post. The API's design philosophy mean it's "...simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan." Oh and someone is working on an actual API implementation.

Some hypothetical changes and implications summarized below
- Grain of salt advised, no professional background

Sebbi's No graphics API:

Unified memory
64-bit GPU pointers everywhere
Shaders = C++ like kernels
Bindless everything
Raster/RT as libraries and intrinsics
No descriptors
No PSOs, permutations and pipeline caching
No resource types
No barriers
No stateful driver
No heap enumeration
No memory type guessing
No legacy shader languages

DX12 → DX13 + Dataflow extensions:
- Command buffers → dataflow graphs
- Ressource objects → pointers
- Descriptors → bindless
- PSOs → dynamic pipelines
- CPU orchestration → GPU autonomy
- Fixed pipelines → unified compute
- Legacy bloated APIs → sleek modern API
- Bloated driver -> thin driver

Fingers crossed RDNA5 and PS6 goes all the way architecturally and even if they don't a hypothetical DX13 still sounds much better than DX12 + WGs. Sounds like we're in for an inevitable programming paradigm shift of a similar magnitude to pure fixed function → programmable shaders in the early 2000s. Add ML and PT on top and the 2030s will be truly nextgen.

marees · Wednesday at 9:10 AM

MrMPFR said:
Bindless everything

Apparently the idea was to bypass cpu latency/overhead by using CUDA like pointers. (No Rust here. O/S needs reliability but Graphics API needs performance)

del42sa · 2025-12-26T15:40:35-0500

marees said:
What is the expectation on the memory situation?
When will it cool down?
Will lpddr5x & lpddr6 be affected?

When will RDNA 5 launch now
What will be the SKUs

Will it be something like this

50xt 12gb (AT4 24 CU) — $300

60 16gb (AT3 40? CU) — $400

60xt 16gb/24gb/32gb (AT3 48 CU) — $500/$550/$600

70 18gb (AT2 56/60 CU) — $650

70xt 18gb/24gb (AT2 72 CU) — $700/$800

80xt 24gb (AT0 128? CU) — $1200+

90xt 36gb (AT0 144? CU) — $1500+

54abf64dd6ea6085d14264d24785851379af5338f14b228fc2ba5df864dd4bda.png

jpiniero · 2025-12-26T15:43:50-0500

del42sa said:
snip

MLID... C'mon man.

ToTTenTranz · 2025-12-27T04:22:30-0500

jpiniero said:
MLID... C'mon man.

IMHO, people should cope better about the fact that MLID has been the primary source of rumors that end up being right on AMD for the past 4+ years.

Up until a couple of days ago we had people here swearing up and down that the 9950X3D2 didn't exist, which MLID leaked. Before that, we had people here swearing AMD wasn't going to launch any high-end RDNA5 SKU until MLID showed the whole range with AT0.

His Intel rumors have been weak and pretty much dead wrong, but AMD's are solid at the moment.

NTMBK · 2025-12-27T08:25:46-0500

MrMPFR said:
Just read the excellent blogpost by Sebastian Aaltonen shared by @Gideon last week. Shocking how flawed the "modern APIs" are and new ones can't come soon enough. DX12's legacy bloat with Work Graphs bolted on top would hold back post-crossgen releases.
Compare that with a feature complete No Graphics API (DX13 and Vulkan 2.0) with accommodations (native design + extensions) for dataflow execution architecture, as described in my prev comment, that could greatly benefit the 10th era of gaming. Basically sounds like DX13 + WGs on steroids.
Especially true for developers that can't afford wiz SWEs as highlighted by @marees post. The API's design philosophy mean it's "...simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan." Oh and someone is working on an actual API implementation.

Some hypothetical changes and implications summarized below
- Grain of salt advised, no professional background

Sebbi's No graphics API:

Unified memory

64-bit GPU pointers everywhere

Shaders = C++ like kernels

Bindless everything

Raster/RT as libraries and intrinsics

No descriptors

No PSOs, permutations and pipeline caching

No resource types

No barriers

No stateful driver

No heap enumeration

No memory type guessing

No legacy shader languages

DX12 → DX13 + Dataflow extensions:
- Command buffers → dataflow graphs
- Ressource objects → pointers
- Descriptors → bindless
- PSOs → dynamic pipelines
- CPU orchestration → GPU autonomy
- Fixed pipelines → unified compute
- Legacy bloated APIs → sleek modern API
- Bloated driver -> thin driver

Fingers crossed RDNA5 and PS6 goes all the way architecturally and even if they don't a hypothetical DX13 still sounds much better than DX12 + WGs. Sounds like we're in for an inevitable programming paradigm shift of a similar magnitude to pure fixed function → programmable shaders in the early 2000s. Add ML and PT on top and the 2030s will be truly nextgen.

I'm still reading through Sebbi's blog post (it's a big boi), and it definitely sounds interesting. I'm not a graphics programmer, but I've done some CUDA in a past career and had to poke around in Unreal's render code to debug issues, and the mess of shader types, resource types etc in DX12 is pretty daunting. Getting it simplified and cleaned up definitely sounds like a big win for programmer productivity and debugging.

I'm more dubious of any claims of potential performance wins. We've been down this road before with DX12/Mantle, and we didn't see a great deal. Graphics hardware is constantly changing, as is the software running on top of it, and today's beautiful simple API will undoubtedly hold back tomorrow's exciting new idea. I'm sure when we're all doing neural shading and path tracing in 10 years' time we'll all be cursing how unsuitable DX13 is.

dangerman1337 · 2025-12-27T09:33:09-0500

MrMPFR said:
Fingers crossed RDNA5 and PS6 goes all the way architecturally and even if they don't a hypothetical DX13 still sounds much better than DX12 + WGs. Sounds like we're in for an inevitable programming paradigm shift of a similar magnitude to pure fixed function → programmable shaders in the early 2000s. Add ML and PT on top and the 2030s will be truly nextgen.

But games made from RDNA5/PS6 from the ground up are a very long way off. Probably 2031 at the earliest assuming PS6 is in 2027 and even then most developers will want to run on pre-DX13 hardware like your RTX 4090s etc, can see the dumb discourse about developers in 2033 or so "discriminating" against 4090 owners.

Win2012R2 · 2025-12-27T11:59:53-0500

del42sa said:

Is MLID doing Swordfish style online job interview for a janitor position at Nvidia?

P.S. The dog...

marees · 2025-12-27T12:04:49-0500

Wish this were true — but this guy 100% makes up stuff for engagement bait

https://twitter.com/x/status/2004942518732161364

jpiniero · 2025-12-27T13:58:23-0500

20% faster than a 4080 wouldn't be too bad for the top product (assuming no consumer products using AT0 end up shipping)

So it'd be like a 6070 Ti? Figure even without the memory surge that'd be close to the 5080's MSRP... so maybe AMD would try $750-$800?

adroc_thurston · 2025-12-27T14:01:53-0500

jpiniero said:
(assuming no consumer products using AT0 end up shipping)

they gotta keep real bad dies somewhere.

jpiniero · 2025-12-27T14:05:25-0500

adroc_thurston said:
they gotta keep real bad dies somewhere.

Ayy Eye workstation cards?

reaperrr3 · 2025-12-27T14:45:12-0500

jpiniero said:
Ayy Eye workstation cards?

Even if all worst-case circumstances like AI demand, high wafer prices, high memory prices etc. hold up until then, AMD could probably ask $1.5k or more for a 128-144 CU/384bit/24GB garbage SKU and it'd still sell, due to NV's alternatives being either slower (6080) or MUCH more expensive (6090), so I don't see why they shouldn't ship at least one token desktop flagship at modest volume for appearances.

I mean, if they hadn't cancelled N4C a bit prematurely (or if they'd at least had an N48+50% mono replacement above N48), they surely would've shipped that too.
AMD clearly overestimated Blackwell's gaming perf and underestimated how much the AI craze would trickle down to desktop in terms of local LLMs, otherwise they'd have done something above N48.

adroc_thurston · 2025-12-27T15:09:01-0500

jpiniero said:
Ayy Eye workstation cards?

Tiny nonexistent segment that wants fat VRAM and Not Completely Botched dies.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Lifer

Diamond Member

Member

Member

Senior member

Member

Member

Diamond Member

Member

Member

Member

Platinum Member

Member

Lifer

Senior member

Lifer

Senior member

Golden Member

Platinum Member

Lifer

Diamond Member

Lifer

Member

Diamond Member