Discussion RDNA 5 / UDNA (CDNA Next) speculation

adroc_thurston · Sep 11, 2025

Saylick said:
And what's the volume and margin expected for cloud gaming?

Pretty dang good.

Saylick said:
Probably laughable compared to Nvidia's AI volumes.

The bubble ain't forever.

stayfrosty · Sep 11, 2025

adroc_thurston said:
Yeah man throw a MI450X at it.

MI450 has like 6-7x the BOM of a CPX. Even with ridiculous nvidia margins that'll be a too big of a TCO advantage to compete. AMD will essentially give up the large context inference market long term if they don't provide a similar solution.. they will..

reaperrr3 · Sep 11, 2025

Win2012R2 said:
Very odd Nvidia announced it now - they are clearly worried about AMD's threat or perhaps trying to counter custom ASICs.

Maybe Rubin CPX (GR202 or whatever it's called) is simply ready earlier than some people expected.

I mean, the rumor mill said gaming Blackwell was ready about half a year before it launched, and with the AI bubble being a thing, with Rubin NV might've opted for a shorter dev cycle with less gaming-related uArch changes and just driving up the tensor capabilities for AI/ML asap.

Saylick said:
And what's the volume and margin expected for cloud gaming? Probably laughable compared to Nvidia's AI volumes.

As long as it's better than volume and margins in discrete desktop market, who are we to argue? Money is money.
Besides, would more competitive PF really make a lot of Rubin demand shift to AT0, despite CUDA clout and all that? Doubt it.

branch_suggestion said:
Whilst NV certainly should win with a 800mm^2ish die, (...)

Saylick said:
(...) It won't touch GR202, that's for sure.

That's more of an open question than it's ever been before, in my opinion.

RDNA4 already has better (raster) IPC per CU vs. Blackwell's SM, and the 5090 was quite CPU-bottlenecked already.
Tensor stuff shouldn't be entirely free in terms of power consumption, either.

In other words, GR202 would have to be either a) on a significantly better process (like N2P vs. N3P), or b) make a bigger jump in SM IPC vs. RDNA5 CUs, in addition to reducing driver CPU bottlenecks, otherwise at least if you pitch equal configs against each other, AT0 could actually tie or even beat GR202 (not in every game, but in some and maybe even on average).

adroc_thurston · Sep 12, 2025

stayfrosty said:
MI450 has like 6-7x the BOM of a CPX.

And?

stayfrosty said:
Even with ridiculous nvidia margins that'll be a too big of a TCO advantage to compete.

That's not how TCO works.
Upfront hardware costs are a tiny fraction of it.

Honestly, very 3/10 FUD, you should go to Beyond3D forums and have a nice Nvidia jackoff session there.

stayfrosty · Sep 12, 2025

adroc_thurston said:
And?

Just saying that I think AMD will very much bring something similar to CPX to the market in the next few years.

adroc_thurston said:
That's not how TCO works.
Upfront hardware costs are a tiny fraction of it.

it's not a tiny fraction lmfao. It's 80% of the TCO. Capital Cost is roughly 4x the Operating Cost per GPU-Cluster over its lifetime (semianalysis)

adroc_thurston · Sep 12, 2025

stayfrosty said:
Just saying that I think AMD will very much bring something similar to CPX to the market in the next few years.

There won't be a market for that in the next few years.

stayfrosty said:
it's not a tiny fraction lmfao. It's 80% of the TCO. Capital Cost is roughly 4x the Operating Cost per GPU-Cluster over its lifetime (semianalysis)

quoting dolan

3/10 ragebait.

SolidQ · Sep 12, 2025

allegedly from docs

is RDNA5 is much stronger in RT than RDNA4?

adroc_thurston · Sep 12, 2025

SolidQ said:
is RDNA5 is much stronger in RT than RDNA4?

yeah, a byproduct of how they've built their shader cores.

SolidQ · Sep 12, 2025

adroc_thurston said:
yeah, a byproduct of how they've built their shader cores.

wonder if it's closer, RDNA5 vs Rubin, than RDNA4 vs Blackwell

adroc_thurston · Sep 12, 2025

SolidQ said:
wonder if it's closer, RDNA5 vs Rubin, than RDNA4 vs Blackwell

it does not matter since we're not gonna be doing RTRT anyway.

tsamolotoff · Sep 12, 2025

Maybe it is just me, but for me CPX isn't some sort of innovative new market grab rather than the admission that the VRAM bandwidth chase has become too expensive even for nVidia and its godzilla $10000+ offerings

adroc_thurston · Sep 12, 2025

tsamolotoff said:
Maybe it is just me, but for me CPX isn't some sort of innovative new market grab rather than the admission that the VRAM bandwidth chase has become too expensive even for nVidia and its godzilla $10000+ offerings

It's just an attempt to cost cut on the premiere inference platform.

igor_kavinski · Sep 12, 2025

adroc_thurston said:
it does not matter since we're not gonna be doing RTRT anyway.

They are switching to real time path tracing?

basix · Sep 12, 2025

Kepler_L2 said:
Cuz AT0 is only ~7 PF or so, AMD is not doing the same matrix cores on gaming dGPUs and DC GPUs like NVIDIA is doing with Rubin.

I have a few thoughts on that:

RDNA5 could double the matrix acceleration ratio, especially on AT0. I think that is already included in your ~7 PF figure?
Even with only ~ 7 PF such a AT0 + MI450 tandem solution might still be a net-win regarding TCO. This depends on the ratio between context/prefill vs. generation/decode stages. As far as I understand LLM execution, with less TFLOPS during context/prefill your context length is just smaller (with same performance). Maybe just 1 Mio. Tokens instead of 4 Mio. Tokens on Rubin CPX
If MI450 and its rackscale solution is cheap enough compared to Vera-Rubin, Rubin CPX does not really matter to customers. Just slim down AMDs margins. If MI450 costs less than 5x of AT0 (40 PFLOPS vs. 7 PFLOPS), better use MI450 than AT0 for very large context lengths.
Not all systems and applications will benefit from Rubin CPX (shorter context length). And as the market is huge, there is some space for AMD to grow into
Maybe AMD surprises us with an integrated solution:
- MI450 could feature additional memory controllers for GDDR7 or LPDDR6 (integrated in the MIDs)
- If you could stack additional chiplets onto the MID or as alternative also on the AID (CPUs for HPC; NPU / pure matrix accelerators for context/prefill) then MI450 = (Rubin + Rubin CPX) in one package
- The context/prefill accelerator chiplets would be the same size of a Zen 6 CCD and feature only low precision matrix acceleration (FP8 / INT8 and smaller). You could pack quite many TFLOPS on e.g. 4...6 of those chiplets per MI450 (e.g. 2...3 chiplets per MID -> 250...400mm2 of Die Space)
- An XDNA3 NPU in 3nm will probably push 100 TFLOPS FP8 (dense) in roughly 15mm2 (XDNA2 with 50 TOPS is roughly 15mm2 in size). Total 250...400mm2 would result in 1700...2700 TFLOPS FP8 dense or 6400...10'800 TFLOPS FP4 incl. sparsity. Now optimize XDNA3 for MI450 (remove FP16 etc. support) and that might double.
- So let's assume ~20 PFLOPS of additional FP4 (incl. sparsity) integrated into MI450. 20/40 PFLOPS is not quite there with Rubin CPX (2*30 / 50 PFLOPS) but could probably be used more efficiently. Unified memory space with main GPU chiplets has other benefits like bigger overall memory buffer. And the +20 PFLOPS of NPU processing power could also be used for the generation/decode phase if the bandwidth between MID and AID is sufficient for that.
Or another surprise with HBM4:
- The HBM logic base Die might be custom and somewhat akin to Processing-In-Memory (PIM)
- This could amplify bandwidth for the generation/decode stage and you could more efficiently use that together with context/prefill (overlapping their execution)

I think that Rubin CPX is a neat extension of the existing rackscale solution idea. But AMDs Engineers are not stupid and the whole context/prefill vs. generation/decode stuff is also known to them (and not new, you can find public info and best practices about that from 2023). As the progress in HW and SW development is very fast we might see one or another surprise the upcoming months and years.

MrMPFR · Sep 12, 2025

adroc_thurston said:
See the problem is that we're not really limited by box/tri testing for RTRT.
Blackwell was barely an improvement RTRT-wise, and that's because making RTRT faster is hard without going for some, mildly unorthodox ways to build a GPU shader core.

What changes would an unorthodox GPU shader core introduce?

igor_kavinski said:
They are switching to real time path tracing?

Nope. PT was 40 series. 50 series unveiled NRC and Neural materials neural rendering, still not production ready. 60 series neural rendering galore. Watch NVIDIA's HPG 2025 keynote. Pretty obvious they want PT + advanced lighting effects approximated using tensor cores.

Even AMD is trying to offload BLAS to MLPs with LSNIF.

MrMPFR · Sep 12, 2025

Saylick said:
That and using the xtor budget for bigger systolic arrays. Ain’t rocket science.

...And getting freezing or rid of high precision ML formats to save on die area. Even with LLM training NVFP4 seems fine.

Blackwell Whitepaper GB202 specs table mentions FP4, FP8, FP16, TF32 and INT8 instructions. NVIDIA could freeze, gimp or retire pre-Hopper formats, then 8X INT8 + FP8 and 12X NVFP4 dense (Semianalysis) per SM vs GB202 to align with Rubin DC Tensor core.

stayfrosty said:
what kind of magic sauce is nvidia putting into these chips to 6-8x FP4 Perf while barely increasing area and power

Nvidia advertises 4 PFLOPs for the PRO 6000 vs 30 PFLOPs for the CPX (both FP4 w sparsity)

FFYI It's 10X NVFP4 dense vs Pro 6000. 3:2 ratio like Rubin DC (Semianalysis). No info on effective NVFP4 clocks or design changes, but NVIDIA will have to lower clocks when running 20-30PFLOPs of NVFP4.

adroc_thurston · Sep 12, 2025

MrMPFR said:
What changes would an unorthodox GPU shader core introduce?

Nothing that matters to you.

MrMPFR said:
Pretty obvious they want PT + advanced lighting effects approximated using tensor cores.

Duh that's the only way.

Kepler_L2 · Sep 12, 2025

MrMPFR said:
What changes would an unorthodox GPU shader core introduce?

RT benefits from stuff that CPUs do like out-of-order execution and branch prediction.

adroc_thurston · Sep 12, 2025

Kepler_L2 said:
RT benefits from stuff that CPUs do like out-of-order execution and branch prediction.

yes we'z gonna loop all the way into Weird ISA Larrabee sooner or later.
PatG-sama I kneel... Otellini-san I yield...

Cheesecake16 · Sep 12, 2025

Kepler_L2 said:
Cuz AT0 is only ~7 PF or so, AMD is not doing the same matrix cores on gaming dGPUs and DC GPUs like NVIDIA is doing with Rubin.

As long as you have enough compute for RAG, your compute numbers don't really matter for inference so 7PF would be plenty, what matters is memory bandwidth and capacity...
As for AMD not using the same matrix cores for client and server... that very well will likely bite AMD in the behind something fierce without getting their PTX equivalent, amdgcnspirv, up and running in a non-expermential/beta state before hand for a number of reasons...

adroc_thurston · Sep 12, 2025

Cheesecake16 said:
that very well will likely bite AMD in the behind something fierce without getting their PTX equivalent, amdgcnspirv, up and running in a non-expermential/beta state before hand for a number of reasons...

Pray that Meta wants that and it'll be done in a matter of months.

Josh128 · Sep 13, 2025

On X, Musk has "endorsed" AMD for small to mid-sized AI models. That is not a nothingburger.

Elon Musk ‘Endorses’ AMD's AI Hardware for Small to Medium AI Models, Implying That There's Potential to Ease Reliance on NVIDIA

Billionaire Elon Musk has tweeted on the performance of AMD's AI hardware, claiming that it is sufficient for small and medium AI models.

wccftech.com

soresu · Sep 13, 2025

Josh128 said:
On X, Musk has "endorsed" AMD for small to mid-sized AI models. That is not a nothingburger.

Elon Musk ‘Endorses’ AMD's AI Hardware for Small to Medium AI Models, Implying That There's Potential to Ease Reliance on NVIDIA

Billionaire Elon Musk has tweeted on the performance of AMD's AI hardware, claiming that it is sufficient for small and medium AI models.

wccftech.com

Lol, at this point I'm pretty sure Lisa Su would rather he just acts as if AMD doesn't exist.

His endorsement certainly isn't going to do them favors now.

igor_kavinski · Sep 13, 2025

adroc_thurston · Sep 13, 2025

soresu said:
Lol, at this point I'm pretty sure Lisa Su would rather he just acts as if AMD doesn't exist.

His endorsement certainly isn't going to do them favors now.

he's less of a sperg now I think.
Maybe the Tesla BoD snipped his balls for disobedience.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Diamond Member

Member

Member

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Lifer

Senior member

Member

Member

Diamond Member

Senior member

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member

Lifer

Diamond Member