Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 55 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Z O X

Junior Member
Oct 31, 2022
23
26
61
RT benefits from stuff that CPUs do like out-of-order execution and branch prediction.

Hmmm... OoO will certainly increase IPC (very excited to see it implemented in a GPU), but I'm not sure if it's crucial for RT.
From my limited understanding, BHV structure is "fixed" per frame and once the ray is cast, calculating intersections is not a random task.
Changes to BHV structure with better caches should bring more benefits for RT I suppose ...
 

soresu

Diamond Member
Dec 19, 2014
4,101
3,560
136
Hmmm... OoO will certainly increase IPC (very excited to see it implemented in a GPU), but I'm not sure if it's crucial for RT.
There are several different ways to do physically correct light transport and to augment the performance of rendering through techniques like ReSTIR, path guiding and such.

I'd be surprised if OoO compute had no significant impact on them.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,634
5,174
136
Maybe it is just me, but for me CPX isn't some sort of innovative new market grab rather than the admission that the VRAM bandwidth chase has become too expensive even for nVidia and its godzilla $10000+ offerings

If it is about ratio of: Compute / Memory Bandwidth

And the big invention of Rubin CPX is to cut (save on) memory bandwidth, there is another way to affect the Compute / Memory Bandwidth ratio.

Namely, increasing the compute. If FP4 is what's it all about, couldn't AMD release Mi400 version with chiplets that are FP4 optimized?

AMD is already planning 2 version of Mi400, HPC and AI oriented. So AMD already has experience in turning the knobs to add more FP64, less FP64.

So this would be just another version, using the same platform, that is all FP4. Or FP4+FP8
 
  • Like
Reactions: Tlh97 and MrMPFR

MrMPFR

Member
Aug 9, 2025
103
206
71
Hmmm... OoO will certainly increase IPC (very excited to see it implemented in a GPU), but I'm not sure if it's crucial for RT.
From my limited understanding, BHV structure is "fixed" per frame and once the ray is cast, calculating intersections is not a random task.
Changes to BHV structure with better caches should bring more benefits for RT I suppose ...
It has to be significant otherwise Kepler wouldn't have mentioned it alongside branch prediction. But an explanation would be appreciated.

CMIIW but isn't OoO execution and branch prediction a slippery slope towards MIMD and CPU style mega-cores, and isn't the entire point of GPUs SIMD parallelism?
Seems like it would be preferable to prioritize data locality and other methods for boosting SIMD occupancy vs brute forcing the issue with complex branch prediction and OoO execution.

A future GPU design could accomplish this by implementing the following (non-exhaustive):
  1. SWC/Thread coherency sorting
  2. Other forms of coherency sorting like ray coherency sorting as seen in the PowerVR Photon.
  3. Memory and scheduling changes prioritizing data locality and fine-grained control
Nr3 requires GPU Work Graphs API to maximize performance and benefits.

But OoO scheduling on GPU could still happen at some point. Not the kind CPUs do but seems like there's a method going more than half of the way towards idealized implementation with a tiny area overhead of 0.007%. 6.9% median speedup, up to 36%, no slowdowns. 100X less area overhead than implementing MIMD on a GPU via FSMs (not the same thing, just for comparison). This obviously won't be AMD's or NVIDIA's exact implementations but GhOST is the first method without the usual drawbacks like slowdowns and large area overhead, so they could resemble it in some areas.
- GhOST OoO paper: https://ieeexplore.ieee.org/document/10609594
- MIMD execution on GPU patent (AMD) https://www.patents-review.com/a/20...s-enabling-mimd-like-execution-flow-simd.html
^IIRC Kepler said this was for CDNA:

It's less about the BVH structure and more about how ray traversal related scheduling, execution and memory accesses is handled on a GPU. Ideally you would want data contained within SM from start to finish, even with multiple bounces. Should lower power consumption, slash memory and scheduling latency and boost performance.
But BVH still needs fundamental changes and RTX Mega Geometry isn't enough.
 

gdansk

Diamond Member
Feb 8, 2011
4,568
7,681
136
How big is the cloud gaming market?

Vs. the inference market Nvidia is targeting. It seems to me cloud gaming is much smaller, by an order of magnitude or more.
Since they actually have a shot there it is bigger than 3% of the inference market they could win.
 

gdansk

Diamond Member
Feb 8, 2011
4,568
7,681
136
3% of inference may be more than all of cloud gaming. And then, there is the other 97% of inference.

You're looking at an entirely projected market.
Still there is no scenario where AMD has an FP4-light part seizing any portion of that market.
"Cloud" visualization has market projections too, I have posted them before. I find them all questionable but larger than 3% of inference AMD could get with maximum effort and tailor made inference parts. Mind you they won't use AT0 for that market in any case. That's the other half of their graphics business.
 
Last edited:

ToTTenTranz

Senior member
Feb 4, 2021
686
1,147
136
It's a bigger market market than PC 'handhelds' for sure.
Which is why it gets a chip, and them things, don't. Simple as.

Why would there be a need for a dedicated chip? Decent handheld chips can simply come from ultrabooks: same battery capacity, same thermal dissipation. Medusa + AT4 is going into handhelds for sure, and I bet some models will even use AT3.


And by the way, the PS6 handheld is a thing, which is pretty much PC hardware.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,074
9,825
106
Why would there be a need for a dedicated chip?
They're tablets dawg.
They need discrete 10W parts.
Medusa + AT4 is going into handhelds for sure, and I bet some models will even use AT3.
That's 30 and 85W.
Idk none of that works.
And by the way, the PS6 handheld is a thing, which is pretty much PC hardware.
They're wholly bespoke in their ecosystem.
And yes, that one will get custom silicon.
Volumes! Actual real volumes.
 

ToTTenTranz

Senior member
Feb 4, 2021
686
1,147
136
They're tablets dawg.
Tablets are 6mm thick with passive cooling.
The Xbox Ally X is 50mm thick with 2x fans.


They need discrete 10W parts.
They've used Phoenix, Strix Point and Rembrandt so far. They're not 10W.
Not even the Switch 2 is a 10W part. It pulls 30W at the wall in docked mode.

That's 30 and 85W.
That's Strix Point and Strix Halo. There are gaming handhelds with both.
Those 30 and 85W can either be able to scale down in clocks and power or they won't get a single design win.
 

marees

Golden Member
Apr 28, 2024
1,737
2,378
96
It's a bigger market market than PC 'handhelds' for sure.
Which is why it gets a chip, and them things, don't. Simple as.
I prefer the nv GeForce now model where you have prepaid/pay per play (& even free credits). Unfortunately nv not available in my region despite a compliant Samsung TV

Looks like Microsoft is gradually getting there with more affordable subscription plans.
 

Kaluan

Senior member
Jan 4, 2022
515
1,092
106
The MDS2 successor in 2029/2030 (with RDNA 5 or RDNA 6) would be the bees knees
Sorry, my brain went on vacation there for a second, what is "MDS2"?

PS: If CU=WGP is indeed a thing with RDNA5/UDNA (all of the other substantial changes aside), I'm guessing that means a CU would also house 2x the "Ray Accelerator" units as before, right? Next to the 128SPs and whatever else they double or beef up per unit.
 

marees

Golden Member
Apr 28, 2024
1,737
2,378
96
Sorry, my brain went on vacation there for a second, what is "MDS2"?

PS: If CU=WGP is indeed a thing with RDNA5/UDNA (all of the other substantial changes aside), I'm guessing that means a CU would also house 2x the "Ray Accelerator" units as before, right? Next to the 128SPs and whatever else they double or beef up per unit.
MDS-1 = medusa point aka strix point successor with 8 CUs & 50? TOPS

MDS-2 = medusa point "little" a new low power variant probably on a new TSMC low power node with 4? CUs & 50? TOPS

MDS-3 = medusa point "baby" aka bumblebee — future Mendocino replacement in theory but TSMC 3nm so will continue to be priced a tier above Mendocino

Now imagine zen-7 monolithic APUs in 2029-2030 with RDNA 5.5??

they will not need 50 or 75 TOPS NPU as RDNA 5.5 by itself has that capability. So let's assume the zen 7 successors to MDS1, MDS2, & MDS3 all have igpus that can do minimum of 75 TOPS

So MDS-2 successor with zen 7 & RDNA 5.5 would be ideal for a monolithic handheld to replace steam deck & compete with switch 2 & PS6 handheld
 
  • Like
Reactions: Kaluan