Discussion RDNA 5 / UDNA (CDNA Next) speculation

marees · Aug 4, 2025

Not sure, if this was mentioned earlier

Here comes Helios

Slated to be available in 2026, AMD’s Helios rack infrastructure is a unified architecture designed for both frontier model training and large-scale inference, delivering “leadership” across compute density, memory bandwidth, and scale-out interconnect.

The double-wide Helios AI rack is fully integrated with AMD’s Zen 6 Eypc CPUs, MI400 GPUs, and Vulcano NICs.

AMD launches Instinct MI350 GPUs, unveils double-wide Helios AI rack-scale system

Announcements made at chip designer’s Advancing AI conference

www.datacenterdynamics.com

soresu · Aug 4, 2025

dr1337 said:
should be pretty easy for AMD to squeeze out 3x with RDNA5

MLID is claiming this for raster tho, not RT/PT.

This is what makes me think it's pretty bad info.

soresu · Aug 4, 2025

ToTTenTranz said:
Due to Infinity Cache and higher clocks on the desktop RDNA2 cards, the PS5 and Series X perform close to the 6700 non-XT.

The consoles of that generation also have fancy game asset compression compunded by direct SSD access by default which can't be understated as an advantage.

ToTTenTranz · Aug 4, 2025

soresu said:
The consoles of that generation also have fancy game asset compression compunded by direct SSD access by default which can't be understated as an advantage.

Yes, and 5 years later we still don't have any decent PC replica, let alone upgrade, of the PS5's I/O engine. With PCIe 5.0 SSDs doing 14GB/s reads we can now brute-force a similar throughput to the VRAM, but Direct Storage is kind of a virtual stillborn at the moment, so we're at that weird moment where the $400 5 year-old PS5 gets shorter loading times in most games than a month-old $4000 PC.

However I wouldn't say that counts for GPU performance unless we're talking about VRAM limits and traversal stutter.

Win2012R2 · Aug 4, 2025

ToTTenTranz said:
PS5's I/O engine.

The key part in it is hardware decompression (dedicated, not on GPU), which is present in Xbox too also but for some odd reason no support in DirectX PC - GPU decompression is crap idea, perhaps next "Xbox" will finally have it added, without that Direct Storage is not solving much

fastandfurious6 · Aug 6, 2025

ToTTenTranz said:
STX Halo reaches like 95% of its gaming performance at 55W, matching a PS5 a Series X. And that's with an extra CCD die that won't be of any use for games.

AMD could make a 3nm chip that has 3x PS5 raster performance at 160W today. It just wouldn't be cheap.

100% this

betting lots of dollaroos that ps6 can get three times ++ ps5 perf

marees · Aug 6, 2025

fastandfurious6 said:
100% this

betting lots of dollaroos that ps6 can get three times ++ ps5 perf

My guess:

PS6 = 5070 ti == cut down xx70 version of RDNA 5 === xx60 xt version of RDNA 6

soresu · Aug 6, 2025

fastandfurious6 said:
100% this

betting lots of dollaroos that ps6 can get three times ++ ps5 perf

Also bear in mind that on the RT perf side early PS5 games had less software technique advantages than early PS6 games will, some of which will depend upon ML augmentation that wasn't really viable until PS5 Pro/RDNA4's more ML/AI focused hw improvements that will only be compounded with RDNA5.

soresu · Aug 6, 2025

Win2012R2 said:
The key part in it is hardware decompression (dedicated, not on GPU), which is present in Xbox too also but for some odd reason no support in DirectX PC

It was supposed to be in the DirectStorage spec, but ended up being pushed to a later version of the standard.

MS seem to be really dropping the ball on gaming over the last few years.

Khronos and/or Valve and other gaming industry leaders need to take the reins on this and get things moving.

The existence of Project Amethyst seems to show that Sony is taking a more active role in that equation, so maybe they might throw their hat in too.

adroc_thurston · Aug 6, 2025

soresu said:
Khronos and/or Valve and other gaming industry leaders need to take the reins on this and get things moving.

The existence of Project Amethyst seems to show that Sony is taking a more active role in that equation, so maybe they might throw their hat in too.

Khronos hasn't done anything useful in eons, and Sony stuff is just a relatively tiny collab with AMD.

MS will always be the driving force behind PC gaming. Just the way things are.

Win2012R2 · Aug 7, 2025

soresu said:
It was supposed to be in the DirectStorage spec, but ended up being pushed to a later version of the standard.

They pushed it to GPU decompression, which of course creates problem as GPU is busy drawing stuff, so there is frame drop - crap solution really, however maybe if neutral texture compression takes off then this is where GPU will excel, still can't see any way without dedicated hardware decompressor, it costs $1 max.

marees · Aug 7, 2025

Dense Geometry Format (DGF)

End result: The prefilter and DGF nodes allow for a smaller BVH footprint, a massively reduced load on the memory subsystem, and permit fast low precision parallel bulk processing of triangle intersection tests. As a result a sizeable speedup is achieved while area investment for ray tri intersect logic is reduced.

Multiple Ray Tracing Patents Filings

One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected.
IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)

The Streaming Wave Coalescer implements thread coherency sorting similar to Intel's TSU and NVIDIA's SER implementations. It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler

End result:
#1 Decentralized local scheduling: A decentralized GPU scheduling architecture that delegates scheduling to the lowest possible level in scheduling hierarchy while handing over almost complete scheduling autonomy to the Shader Engines (via WGS) and allowing WGPs to launch their own work. Improves scheduling latency and allows much more fine grained scheduling.
#2 Bottoms up scalable architecture: This is a bottom up instead of top down GPU scheduling paradigm. Everything operates on the assumption of local knows best although brakes are built into the system where higher scheduler takes control if a local scheduler is overloaded or can't feed its WGPs properly. Since each SE functions as its own GPU core scaling is no longer dictated by the scheduling capabilities of the global processor but how quickly it can prepare work and do load balancing across SEs.
#3 A boon for chiplet based GPUs: Preparing work in a global shared mailbox and doing some load balancing across SEs is far less demanding than micromanaging everything. As a result wider GPU designs should benefit the most and for chiplet based architectures the speedup could be even greater due to the latency mitigation and bottom up scheduling paradigm.

A Few Important Patents Filings

The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."

https://www.reddit.com/r/hardware/comments/1mjiusp/amds_postrdna_4_patent_filings_signal_major

marees · Aug 7, 2025

marees said:
Dense Geometry Format (DGF)
End result: The prefilter and DGF nodes allow for a smaller BVH footprint, a massively reduced load on the memory subsystem, and permit fast low precision parallel bulk processing of triangle intersection tests. As a result a sizeable speedup is achieved while area investment for ray tri intersect logic is reduced.

Multiple Ray Tracing Patents Filings
One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected.
IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)
The Streaming Wave Coalescer implements thread coherency sorting similar to Intel's TSU and NVIDIA's SER implementations. It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler
End result:
#1 Decentralized local scheduling: A decentralized GPU scheduling architecture that delegates scheduling to the lowest possible level in scheduling hierarchy while handing over almost complete scheduling autonomy to the Shader Engines (via WGS) and allowing WGPs to launch their own work. Improves scheduling latency and allows much more fine grained scheduling.
#2 Bottoms up scalable architecture: This is a bottom up instead of top down GPU scheduling paradigm. Everything operates on the assumption of local knows best although brakes are built into the system where higher scheduler takes control if a local scheduler is overloaded or can't feed its WGPs properly. Since each SE functions as its own GPU core scaling is no longer dictated by the scheduling capabilities of the global processor but how quickly it can prepare work and do load balancing across SEs.
#3 A boon for chiplet based GPUs: Preparing work in a global shared mailbox and doing some load balancing across SEs is far less demanding than micromanaging everything. As a result wider GPU designs should benefit the most and for chiplet based architectures the speedup could be even greater due to the latency mitigation and bottom up scheduling paradigm.

A Few Important Patents Filings
The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."

https://www.reddit.com/r/hardware/comments/1mjiusp/amds_postrdna_4_patent_filings_signal_major

TL;DR for patents

Dense Geometry Format/DGF: Block-based geometry compression. Reduces VRAM usage, BVH build size and reduces RT's load on memory and cache subsystems. Leveraging DGF nodes alongside prefiltering nodes to do bulk processing of triangles results in a significant speedup of triangle intersection testing at a lower area overhead than current method.

Configurable convex polygon ray/edge testing: Allows shares of results from edges between polygons eliminating duplicative inside/outside testing.

DMM patents: Three patents about AMD's Displaced Micro Meshes implementation. Looks related to NVIDIA's deprecated DMM implementation in Ada Lovelace. It uses bounding prisms on top of interpolated DMM (made from base triangle) to find ray intersections and alternatively subprisms within the prism.

Streaming Wave Coalescer (SWC) circuit: Does thread coherency sorting of divergent threads + spill-after programming model for devs to maximize benefits of SWC. Accomplishes same thing as NVIDIA's SER and Intel's TSU. Very important for path tracing.

Workgroup self-launch: Local launchers allows workgroup processors (WGPs) to launch work on their own independent of Shader Programming Interface (SPI). Also maintain queues and ressource management on their own but can lease ressources from SPI.

Work Graph Scheduler (WGS): Shader Engine (SE) level scheduler that operates independent from global scheduler except sharing mail box (global work item store) benefitting from load balancing across SEs. WGS results in finer grained scheduling and lower latency improving performance.

If a WGS is overloaded or WGPs within a SE are underutilized work items are migrated to other WGS/SEs via the global scheduler and global data share/mail box ensuring load balancing across SEs.
In addition one Asynchronous Dispatch controller operates under WGS within each SE. It builds waves and launches work for WGPs within each SE.

The WGS change allows GPU to be far more scalable as each scheduling domain is limited to a group of WGPs within a SEs. The global schedulers only job is to prepare work for the SEs and ensure even load balancing across Shader Engines. Scaling efficiency is no longer dictated by the frontends ability to schedule for entire GPU but how quickly it can load and prepare work items and do load balancing.

The scheduling changes proposed in patents are quite significant. They could be beneficial to branchy code such as work graphs and WGS would probably enable superior scaling for wider GPUs.

Fixed function emulation: Shaders could emulate fixed function HW and take over via a reconfigurable virtual graphics and compute processor pipelines when FF HW becomes overwhelmed.

Accelerated indirect draw fetching: Using accelerators would help speed up indirect draw fetching leading to lower computational latency and no longer making alignment of data structures important for performance.

https://www.reddit.com/r/hardware/comments/1mjiusp/amds_postrdna_4_patent_filings_signal_major

Keller_TT · Aug 7, 2025

marees said:
My guess:

PS6 = 5070 ti == cut down xx70 version of RDNA 5 === xx60 xt version of RDNA 6

GPU wise, I think it'll be approx 9070, may be with a bit better RT. The CPU's big bump from Zen 2 to 6: I'm thinking approx. 7800X3D considering clocks and SoC TDP.

A 5070 Super with 18GB and up to 10% perf boost, that'll probably be a close approximation, with 7800X3D. PS5 is ~3700X + 2070S.

Plenty good for a console, with all the PlayStation tooling. Let'em bank great power efficiency and hitting PS5 launch prices (with a disc please). Sonly will likely hit higher launch reception than PS5.

eek2121 · Aug 8, 2025

gdansk said:
>90%.

It’s not that high lol.

Even if it gets killed, they’ll still pump out a pro SKU with similar specs for AI.

SolidQ · Aug 8, 2025

Keller_TT said:
GPU wise, I think it'll be approx 9070, may be with a bit better RT. The CPU's big bump from Zen 2 to 6: I'm thinking approx. 7800X3D considering clocks and SoC TDP.

Kepler already was saying
PS6 ~9070XT
XBOX ~ RTX 5080

jpiniero · Aug 8, 2025

SolidQ said:
Kepler already was saying
PS6 ~9070XT

That's pretty unrealistic, esp given the rumor of it also being 160 W for the system power draw.

Win2012R2 · Aug 8, 2025

Keller_TT said:
I think it'll be approx 9070, may be with a bit better RT.

There is no way RT will be "a bit better" - it will need to be x4 at least for PS6 to last another 7-8 years (that's till 2035, we should have flying cars on garbage by then)

Bit question what's the memory will be like - 24 GB VRAM probably at best, and if we lucky some extra slower LPDDR6, how else they'll get even small AI models if they don't have the RAM, this thing is supposed to last for a while

Saylick · Aug 8, 2025

jpiniero said:
That's pretty unrealistic, esp given the rumor of it also being 160 W for the system power draw.

Kepler also did say that the perf/W gains are overly optimistic, so if 9070 XT performance is achieved, it likely won't be as low as 160W.

adroc_thurston · Aug 8, 2025

Saylick said:
Kepler also did say that the perf/W gains are overly optimistic, so if 9070 XT performance is achieved, it likely won't be as low as 160W.

Yeah it'll be 180W.
Whatever.

Saylick · Aug 8, 2025

adroc_thurston said:
Yeah it'll be 180W.
Whatever.

That would imply if AMD created a GPU with twice the hardware as the PS6 GPU, they’d be able to hit 5090 levels of performance in a 350W TDP.

gdansk · Aug 8, 2025

Saylick said:
That would imply if AMD created a GPU with twice the hardware as the PS6 GPU, they’d be able to hit 5090 levels of performance in a 350W TDP.

It doesn't imply anything about scaling up.
It would have to be about 35-40% more efficient than their existing N48 chop.
If there is a die shrink and some new IP it isn't unheard of.

adroc_thurston · Aug 8, 2025

Saylick said:
they’d be able to hit 5090 levels of performance in a 350W TDP.

looks at AT0 gfx13
Yeah?

gdansk said:
It doesn't imply anything about scaling up

gfx13 does introduce a grabbag of tricks to make high WGP count blobs work a lot better.

Kepler_L2 · Aug 8, 2025

Saylick said:
Kepler also did say that the perf/W gains are overly optimistic, so if 9070 XT performance is achieved, it likely won't be as low as 160W.

Yeah, 9070XT was 250W in early 2024.

adroc_thurston · Aug 8, 2025

Kepler_L2 said:
Yeah, 9070XT was 250W in early 2024.

276W but I digress.
Mind you, product binning decisions also influence the final v/f quite a bit.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Platinum Member

Here comes Helios​

Diamond Member

Diamond Member

Senior member

Golden Member

Senior member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Platinum Member

Dense Geometry Format (DGF)​

Multiple Ray Tracing Patents Filings​

Streaming Wave Coalescer (SWC)​

Local Launchers and Work Graph Scheduler​

A Few Important Patents Filings​

Platinum Member

Dense Geometry Format (DGF)​

Multiple Ray Tracing Patents Filings​

Streaming Wave Coalescer (SWC)​

Local Launchers and Work Graph Scheduler​

A Few Important Patents Filings​

Member

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Here comes Helios

Dense Geometry Format (DGF)

Multiple Ray Tracing Patents Filings

Streaming Wave Coalescer (SWC)

Local Launchers and Work Graph Scheduler

A Few Important Patents Filings

Dense Geometry Format (DGF)

Multiple Ray Tracing Patents Filings

Streaming Wave Coalescer (SWC)

Local Launchers and Work Graph Scheduler

A Few Important Patents Filings