Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 54 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

stayfrosty

Member
Apr 4, 2024
26
56
51
Yeah man throw a MI450X at it.
MI450 has like 6-7x the BOM of a CPX. Even with ridiculous nvidia margins that'll be a too big of a TCO advantage to compete. AMD will essentially give up the large context inference market long term if they don't provide a similar solution.. they will..
 

reaperrr3

Member
May 31, 2024
131
374
96
Very odd Nvidia announced it now - they are clearly worried about AMD's threat or perhaps trying to counter custom ASICs.
Maybe Rubin CPX (GR202 or whatever it's called) is simply ready earlier than some people expected.

I mean, the rumor mill said gaming Blackwell was ready about half a year before it launched, and with the AI bubble being a thing, with Rubin NV might've opted for a shorter dev cycle with less gaming-related uArch changes and just driving up the tensor capabilities for AI/ML asap.

And what's the volume and margin expected for cloud gaming? Probably laughable compared to Nvidia's AI volumes.
As long as it's better than volume and margins in discrete desktop market, who are we to argue? Money is money.
Besides, would more competitive PF really make a lot of Rubin demand shift to AT0, despite CUDA clout and all that? Doubt it.

Whilst NV certainly should win with a 800mm^2ish die, (...)
(...) It won't touch GR202, that's for sure.
That's more of an open question than it's ever been before, in my opinion.

RDNA4 already has better (raster) IPC per CU vs. Blackwell's SM, and the 5090 was quite CPU-bottlenecked already.
Tensor stuff shouldn't be entirely free in terms of power consumption, either.

In other words, GR202 would have to be either a) on a significantly better process (like N2P vs. N3P), or b) make a bigger jump in SM IPC vs. RDNA5 CUs, in addition to reducing driver CPU bottlenecks, otherwise at least if you pitch equal configs against each other, AT0 could actually tie or even beat GR202 (not in every game, but in some and maybe even on average).
 
  • Like
Reactions: Tlh97

adroc_thurston

Diamond Member
Jul 2, 2023
7,094
9,851
106
MI450 has like 6-7x the BOM of a CPX.
And?
Even with ridiculous nvidia margins that'll be a too big of a TCO advantage to compete.
That's not how TCO works.
Upfront hardware costs are a tiny fraction of it.

Honestly, very 3/10 FUD, you should go to Beyond3D forums and have a nice Nvidia jackoff session there.
 

stayfrosty

Member
Apr 4, 2024
26
56
51
Just saying that I think AMD will very much bring something similar to CPX to the market in the next few years.
That's not how TCO works.
Upfront hardware costs are a tiny fraction of it.
it's not a tiny fraction lmfao. It's 80% of the TCO. Capital Cost is roughly 4x the Operating Cost per GPU-Cluster over its lifetime (semianalysis)
 
  • Like
Reactions: Joe NYC

adroc_thurston

Diamond Member
Jul 2, 2023
7,094
9,851
106
Just saying that I think AMD will very much bring something similar to CPX to the market in the next few years.
There won't be a market for that in the next few years.
it's not a tiny fraction lmfao. It's 80% of the TCO. Capital Cost is roughly 4x the Operating Cost per GPU-Cluster over its lifetime (semianalysis)
quoting dolan
3/10 ragebait.
 

SolidQ

Golden Member
Jul 13, 2023
1,501
2,464
106
allegedly from docs
vPMFRKqbg6hKc1z5.jpg


AakeuAmD5SaED2i4.jpeg


is RDNA5 is much stronger in RT than RDNA4?
 

tsamolotoff

Senior member
May 19, 2019
256
510
136
Maybe it is just me, but for me CPX isn't some sort of innovative new market grab rather than the admission that the VRAM bandwidth chase has become too expensive even for nVidia and its godzilla $10000+ offerings
 

basix

Senior member
Oct 4, 2024
249
504
96
Cuz AT0 is only ~7 PF or so, AMD is not doing the same matrix cores on gaming dGPUs and DC GPUs like NVIDIA is doing with Rubin.

I have a few thoughts on that:
  • RDNA5 could double the matrix acceleration ratio, especially on AT0. I think that is already included in your ~7 PF figure?
  • Even with only ~ 7 PF such a AT0 + MI450 tandem solution might still be a net-win regarding TCO. This depends on the ratio between context/prefill vs. generation/decode stages. As far as I understand LLM execution, with less TFLOPS during context/prefill your context length is just smaller (with same performance). Maybe just 1 Mio. Tokens instead of 4 Mio. Tokens on Rubin CPX
  • If MI450 and its rackscale solution is cheap enough compared to Vera-Rubin, Rubin CPX does not really matter to customers. Just slim down AMDs margins. If MI450 costs less than 5x of AT0 (40 PFLOPS vs. 7 PFLOPS), better use MI450 than AT0 for very large context lengths.
  • Not all systems and applications will benefit from Rubin CPX (shorter context length). And as the market is huge, there is some space for AMD to grow into
  • Maybe AMD surprises us with an integrated solution:
    • MI450 could feature additional memory controllers for GDDR7 or LPDDR6 (integrated in the MIDs)
    • If you could stack additional chiplets onto the MID or as alternative also on the AID (CPUs for HPC; NPU / pure matrix accelerators for context/prefill) then MI450 = (Rubin + Rubin CPX) in one package
    • The context/prefill accelerator chiplets would be the same size of a Zen 6 CCD and feature only low precision matrix acceleration (FP8 / INT8 and smaller). You could pack quite many TFLOPS on e.g. 4...6 of those chiplets per MI450 (e.g. 2...3 chiplets per MID -> 250...400mm2 of Die Space)
    • An XDNA3 NPU in 3nm will probably push 100 TFLOPS FP8 (dense) in roughly 15mm2 (XDNA2 with 50 TOPS is roughly 15mm2 in size). Total 250...400mm2 would result in 1700...2700 TFLOPS FP8 dense or 6400...10'800 TFLOPS FP4 incl. sparsity. Now optimize XDNA3 for MI450 (remove FP16 etc. support) and that might double.
    • So let's assume ~20 PFLOPS of additional FP4 (incl. sparsity) integrated into MI450. 20/40 PFLOPS is not quite there with Rubin CPX (2*30 / 50 PFLOPS) but could probably be used more efficiently. Unified memory space with main GPU chiplets has other benefits like bigger overall memory buffer. And the +20 PFLOPS of NPU processing power could also be used for the generation/decode phase if the bandwidth between MID and AID is sufficient for that.
  • Or another surprise with HBM4:
    • The HBM logic base Die might be custom and somewhat akin to Processing-In-Memory (PIM)
    • This could amplify bandwidth for the generation/decode stage and you could more efficiently use that together with context/prefill (overlapping their execution)

I think that Rubin CPX is a neat extension of the existing rackscale solution idea. But AMDs Engineers are not stupid and the whole context/prefill vs. generation/decode stuff is also known to them (and not new, you can find public info and best practices about that from 2023). As the progress in HW and SW development is very fast we might see one or another surprise the upcoming months and years.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
103
207
71
See the problem is that we're not really limited by box/tri testing for RTRT.
Blackwell was barely an improvement RTRT-wise, and that's because making RTRT faster is hard without going for some, mildly unorthodox ways to build a GPU shader core.
What changes would an unorthodox GPU shader core introduce?

They are switching to real time path tracing?
Nope. PT was 40 series. 50 series unveiled NRC and Neural materials neural rendering, still not production ready. 60 series neural rendering galore. Watch NVIDIA's HPG 2025 keynote. Pretty obvious they want PT + advanced lighting effects approximated using tensor cores.

Even AMD is trying to offload BLAS to MLPs with LSNIF.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
103
207
71
That and using the xtor budget for bigger systolic arrays. Ain’t rocket science.
...And getting freezing or rid of high precision ML formats to save on die area. Even with LLM training NVFP4 seems fine.

Blackwell Whitepaper GB202 specs table mentions FP4, FP8, FP16, TF32 and INT8 instructions. NVIDIA could freeze, gimp or retire pre-Hopper formats, then 8X INT8 + FP8 and 12X NVFP4 dense (Semianalysis) per SM vs GB202 to align with Rubin DC Tensor core.

what kind of magic sauce is nvidia putting into these chips to 6-8x FP4 Perf while barely increasing area and power

Nvidia advertises 4 PFLOPs for the PRO 6000 vs 30 PFLOPs for the CPX (both FP4 w sparsity)

FFYI It's 10X NVFP4 dense vs Pro 6000. 3:2 ratio like Rubin DC (Semianalysis). No info on effective NVFP4 clocks or design changes, but NVIDIA will have to lower clocks when running 20-30PFLOPs of NVFP4.
 
Last edited:

Cheesecake16

Member
Aug 5, 2020
31
105
106
Cuz AT0 is only ~7 PF or so, AMD is not doing the same matrix cores on gaming dGPUs and DC GPUs like NVIDIA is doing with Rubin.
As long as you have enough compute for RAG, your compute numbers don't really matter for inference so 7PF would be plenty, what matters is memory bandwidth and capacity...
As for AMD not using the same matrix cores for client and server... that very well will likely bite AMD in the behind something fierce without getting their PTX equivalent, amdgcnspirv, up and running in a non-expermential/beta state before hand for a number of reasons...
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,094
9,851
106
that very well will likely bite AMD in the behind something fierce without getting their PTX equivalent, amdgcnspirv, up and running in a non-expermential/beta state before hand for a number of reasons...
Pray that Meta wants that and it'll be done in a matter of months.
 

soresu

Diamond Member
Dec 19, 2014
4,105
3,566
136
On X, Musk has "endorsed" AMD for small to mid-sized AI models. That is not a nothingburger.


Lol, at this point I'm pretty sure Lisa Su would rather he just acts as if AMD doesn't exist.

His endorsement certainly isn't going to do them favors now.