Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 74 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

marees

Platinum Member
Apr 28, 2024
2,207
2,857
96
When we would actually expect to see next-gen Nvidia Rubin or AMD RDNA 5 is surely going to be 2027 at this point - and if RTX 50-series Super is delayed (Nvidia will argue that it hasn't announced anything, therefore there is no delay) and arrives in, say, Q3 2026, we would expect that to push back Rubin until much later into 2027.

Graphics innovation drives the PC gaming market and updates to existing lines could be a long way off, let alone true next-gen upgrades. How long really depends on what 3nm inventory AMD and Nvidia have secured from chip manufacturer TSMC and if memory is available at a reasonable price to make a sizeable roll-out possible.

 
  • Like
Reactions: Tlh97

basix

Senior member
Oct 4, 2024
303
601
96
Being stuck with Clamshell the entire generation would be rough.

I'm sure IO is going to be mega expensive on N3... but I would much rather do 256-bit without clamshell to get the extra memory bandwidth.

Well, maybe it is simply not required to have more memory bandwidth:
- Revamped CUs and respective low level caches (bigger capacity)
- Out-of-order execution (increase hardware utilization of ALUs and cache)
- Maybe L0 cache sharing across multiple CUs (reduce wasted SRAM capacity, reduce LLC & DRAM bandwidth requirements)
- Universal compression (smaller memory footprint, reduce bandwidth requirements)
- DGF & DMM (smaller memory footprint, reduce bandwidth requirements)
- Neural techniques like NTC which aim to reduce data fetching from DRAM but rather use more compute from matrix engines (whose performance mostly rely on CU low level caches) to generate or extract data and information
- Work graphs and procedural algorithms with dynamic execution on CU level (reduces code footprints and reduces bandwidth pressure from higher level caches and DRAM)

All those things aim to maximize usage of low level CU resources, increase data locality and reduce load on higher level structures like LLC and DRAM.
It seems that there is much going on regarding rethinking GPU architecture as a whole.
 

basix

Senior member
Oct 4, 2024
303
601
96
As 42 Gbps and 48 Gbps are already announced: Wouldn't Rubin CPX benefit from more bandwidth?

For gaming I do not see bandwidth demands at these levels. 32...36 Gbps seem to be fine on e.g. a 512bit 6090 or AMDs equivalent based on AT0.
 
Last edited:

basix

Senior member
Oct 4, 2024
303
601
96
Sure, no doubt about that. One of the best overviews you get in this paper: https://arxiv.org/pdf/2410.18038
pod-attention-unlocking-full-prefill-decode-overlap-for-faster-llm-inference-0.png


48 Gbps at 512bit results in 3 TB/s bandwidth. "Big Rubin" with HBM4 will feature 20 TB/s or even more. That is still a huge difference.
There will be cases, where higher bandwidth on Rubin CPX will be beneficial. And when I pay millions for a NVL144 setup, a few Dollars more for faster GDDR7 will not matter regarding overall cost.
 

marees

Platinum Member
Apr 28, 2024
2,207
2,857
96
Coming back to this long ago leaked roadmap by wccftech, a few observations:

  1. Absolutely no laptop/mobile discrete GPUs. It is all medusa premium/halo if it comes to that
  2. No halo desktop part. Does it mean AT0 is strictly for xcloud, professional use cases & not for gaming 🤔
  3. RDNA 5 desktop replaces not only N44 but also N33. A big clue of AT4 then. (With AT3 & AT2 positioned above it)

 

ToTTenTranz

Senior member
Feb 4, 2021
914
1,524
136
Coming back to this long ago leaked roadmap by wccftech, a few observations:

  1. Absolutely no laptop/mobile discrete GPUs. It is all medusa premium/halo if it comes to that
  2. No halo desktop part. Does it mean AT0 is strictly for xcloud, professional use cases & not for gaming 🤔
  3. RDNA 5 desktop replaces not only N44 but also N33. A big clue of AT4 then. (With AT3 & AT2 positioned above it)


I'm guessing this is from before AMD regained a bit of trust in the GPU guys thanks to RDNA4 and decided to greenlight AT0 consumer variations.
 
  • Like
Reactions: Win2012R2

basix

Senior member
Oct 4, 2024
303
601
96
I'm guessing this is from before AMD regained a bit of trust in the GPU guys thanks to RDNA4 and decided to greenlight AT0 consumer variations.
AT0 could be used as a Rubin CPX alike SKU. Who knows what the main purpose of that chip is.
 

reaperrr3

Member
May 31, 2024
169
493
96
AMD's unique chance is to launch RDNA 5 next year, hope they don't squander it.
Track record of recent years suggests that new gens launch either around the most pessimistic rumored time, or even later.
And the latest most pessimistic rumors point at Computex 2027 for RDNA5, and H2 2027 for consumer Rubin... so yeah.

Though I don't think AMD is "squandering" anything, because that would suggest they could rake in tons of money/loads of market share if they launched early, and I highly doubt that.
We don't know the wafer price delta of N3P vs. N4P, and we don't know how expensive the G7/LP5X mem configs of ATx will stack up per GB vs current G6 solutions, or NV's G7 contracts, for that matter.

Anyway, I think 2026 is largely dead as far as real new (GPU) products go, because neither the memory situation nor N3P maturity are where either IHV wants them to be.
 
Last edited:
  • Like
Reactions: Joe NYC and RnR_au

Gideon

Platinum Member
Nov 27, 2007
2,044
5,103
136
Slightly OT, but people here asked me why I think that current Vulkan and DX12 are ancient and stupid and should have been replaced with the previous console gen:

Sebbi's fresh blog post is a gem discussing this matter:

 

marees

Platinum Member
Apr 28, 2024
2,207
2,857
96
Slightly OT, but people here asked me why I think that current Vulkan and DX12 are ancient and stupid and should have been replaced with the previous console gen:

Sebbi's fresh blog post is a gem discussing this matter:

Takeaways:

My prototype API shows what is achievable with modern GPU architectures today, if we mix the best bits from all the latest APIs. It is possible to build an API that is simpler to use than DirectX 11 and Metal 1.0, yet it offers better performance and flexibility than DirectX 12 and Vulkan. We should embrace the modern bindless hardware.
HLSL and GLSL shading languages were designed over 20 years ago as a framework of 1:1 elementwise transform functions (vertex, pixel, geometry, hull, domain, etc). Memory access is abstracted and array handling is cumbersome as there’s no support for pointers. Despite 20 years of existence, HLSL and GLSL have failed to accumulate a library ecosystem. CUDA in contrast is a composable language exposing memory directly and new features (such as AI tensor cores) though intrinsics. CUDA has a broad library ecosystem, which has propelled Nvidia into $4T valuation. We should learn from it.

Min spec hardware

Nvidia Turing (RTX 2000 series, 2018) introduced ray-tracing, tensor cores, mesh shaders, low latency raw memory paths, bigger & faster caches, scalar unit, secondary integer pipeline and many other future looking features. Officially PCIe ReBAR support launched with RTX 3000 series, but there exists hacked Turing drivers that support it too, indicating that the hardware is capable of it. This 7 year old GPU supports everything we need. Nvidia just ended GTX 1000 series driver support in fall 2025. All currently supported Nvidia GPUs could be supported by our new API.
AMD RDNA2 (RX 6000 series, 2020) matched Nvidia’s feature set with ray-tracing and mesh shaders. One year earlier, RDNA 1 introduced coherent L2$, new L1$ level, fast L0$, generic DCC read/write paths, fastpath unfiltered loads and a modern SIMD32 architecture. PCIe ReBAR is officially supported (brand name “Smart Access Memory”). This 5 year old GPU supports everything we need. AMD ended GCN driver support already in 2021. Today RDNA 1 & RDNA 2 only receive bug fixes and security updates, RDNA 3 is the oldest GPU receiving game optimizations. All the currently supported AMD GPUs could be supported by our API.
Intel Alchemist / Xe1 (2022) were the first Intel chips with SM 6.6 global indexable heap support. These chips also support ray-tracing, mesh shaders, PCIe ReBAR (discrete) and UMA (integrated). These 3 year old Intel GPUs support everything we need.
Apple M1 / A14 (MacBook M1, iPhone 12, 2020) support Metal 4.0. Metal 4.0 guarantees GPU memory visibility to CPU (UMA on both phones and computers), and allows the user to write 64-bit pointers and 64-bit texture handles directly into GPU memory. Metal 4.0 has a new residency set API, solving a crucial usability issue with bindless resource management in the old useResource/useHeap APIs. iOS 26 still supports iPhone 11. Developers are not allowed to ship apps that require Metal 4.0 just yet. iOS 27 likely deprecates iPhone 11 support next year. On Mac, if you drop Intel Mac support, you have guaranteed Metal 4.0 support. M1-M5 = 5 generations = 5 years.
ARM Mali-G710 (2021) is ARMs first modern architecture. It introduced their new command stream frontend (CSF), reducing the CPU dependency of draw call building and adding crucial features like multi-draw indirect and compute queues. Non-uniform index texture sampling is significantly faster and the AFBC lossless compressor now supports 16-bit floating point targets. G710 supports Vulkan BDA and descriptor buffer extensions and is capable of supporting the new 2025 unified image layout extension with future drivers. The Mali-G715 (2022) introduced support for ray-tracing.
Qualcomm Adreno 650 (2019) supports Vulkan BDA, descriptor buffer and unified image layout extensions, 16-bit storage/math, dynamic rendering and extended dynamic state with the latest Turnip open source drivers. Adreno 740 (2022) introduced support for ray-tracing.
PowerVR DXT (Pixel 10, 2025) is PowerVRs first architecture that supports Vulkan descriptor buffer and buffer device address extensions. It also supports 64-bit atomics, 8-bit and 16-bit storage/math, dynamic rendering, extended dynamic state and all the other features we require.
 

ToTTenTranz

Senior member
Feb 4, 2021
914
1,524
136
PowerVR DXT (Pixel 10, 2025) is PowerVRs first architecture that supports Vulkan descriptor buffer and buffer device address extensions. It also supports 64-bit atomics, 8-bit and 16-bit storage/math, dynamic rendering, extended dynamic state and all the other features we require.

Too bad that Pixel 11's Tensor G6 is apparently going back a generation to CXT, though.
 

randomhero

Member
Apr 28, 2020
196
300
136
Slightly OT, but people here asked me why I think that current Vulkan and DX12 are ancient and stupid and should have been replaced with the previous console gen:

Sebbi's fresh blog post is a gem discussing this matter:

I remember Sebbi being proponent of "no API" way back in early 2010s when GCN arch came out and consoles with same arch. And he could always back it up with deep expertise.
Thank you for the link, I always enjoy Sebbi's deep dives(even though I'm way out of my depth here, pun intended 😀).