Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 32 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

marees

Golden Member
Apr 28, 2024
1,516
2,116
96
Not sure, if this was mentioned earlier

Here comes Helios​

Slated to be available in 2026, AMD’s Helios rack infrastructure is a unified architecture designed for both frontier model training and large-scale inference, delivering “leadership” across compute density, memory bandwidth, and scale-out interconnect.

The double-wide Helios AI rack is fully integrated with AMD’s Zen 6 Eypc CPUs, MI400 GPUs, and Vulcano NICs.

 
  • Like
Reactions: lightmanek

soresu

Diamond Member
Dec 19, 2014
3,995
3,446
136
Due to Infinity Cache and higher clocks on the desktop RDNA2 cards, the PS5 and Series X perform close to the 6700 non-XT.
The consoles of that generation also have fancy game asset compression compunded by direct SSD access by default which can't be understated as an advantage.
 

ToTTenTranz

Senior member
Feb 4, 2021
547
980
136
The consoles of that generation also have fancy game asset compression compunded by direct SSD access by default which can't be understated as an advantage.
Yes, and 5 years later we still don't have any decent PC replica, let alone upgrade, of the PS5's I/O engine. With PCIe 5.0 SSDs doing 14GB/s reads we can now brute-force a similar throughput to the VRAM, but Direct Storage is kind of a virtual stillborn at the moment, so we're at that weird moment where the $400 5 year-old PS5 gets shorter loading times in most games than a month-old $4000 PC.


However I wouldn't say that counts for GPU performance unless we're talking about VRAM limits and traversal stutter.
 

Win2012R2

Golden Member
Dec 5, 2024
1,178
1,206
96
PS5's I/O engine.
The key part in it is hardware decompression (dedicated, not on GPU), which is present in Xbox too also but for some odd reason no support in DirectX PC - GPU decompression is crap idea, perhaps next "Xbox" will finally have it added, without that Direct Storage is not solving much
 

fastandfurious6

Senior member
Jun 1, 2024
689
869
96
STX Halo reaches like 95% of its gaming performance at 55W, matching a PS5 a Series X. And that's with an extra CCD die that won't be of any use for games.

AMD could make a 3nm chip that has 3x PS5 raster performance at 160W today. It just wouldn't be cheap.


100% this

betting lots of dollaroos that ps6 can get three times ++ ps5 perf
 
  • Like
Reactions: Tlh97 and marees

soresu

Diamond Member
Dec 19, 2014
3,995
3,446
136
100% this

betting lots of dollaroos that ps6 can get three times ++ ps5 perf
Also bear in mind that on the RT perf side early PS5 games had less software technique advantages than early PS6 games will, some of which will depend upon ML augmentation that wasn't really viable until PS5 Pro/RDNA4's more ML/AI focused hw improvements that will only be compounded with RDNA5.
 

soresu

Diamond Member
Dec 19, 2014
3,995
3,446
136
The key part in it is hardware decompression (dedicated, not on GPU), which is present in Xbox too also but for some odd reason no support in DirectX PC
It was supposed to be in the DirectStorage spec, but ended up being pushed to a later version of the standard.

MS seem to be really dropping the ball on gaming over the last few years.

Khronos and/or Valve and other gaming industry leaders need to take the reins on this and get things moving.

The existence of Project Amethyst seems to show that Sony is taking a more active role in that equation, so maybe they might throw their hat in too.
 

adroc_thurston

Diamond Member
Jul 2, 2023
6,393
9,011
106
Khronos and/or Valve and other gaming industry leaders need to take the reins on this and get things moving.

The existence of Project Amethyst seems to show that Sony is taking a more active role in that equation, so maybe they might throw their hat in too.
Khronos hasn't done anything useful in eons, and Sony stuff is just a relatively tiny collab with AMD.

MS will always be the driving force behind PC gaming. Just the way things are.
 

Win2012R2

Golden Member
Dec 5, 2024
1,178
1,206
96
It was supposed to be in the DirectStorage spec, but ended up being pushed to a later version of the standard.
They pushed it to GPU decompression, which of course creates problem as GPU is busy drawing stuff, so there is frame drop - crap solution really, however maybe if neutral texture compression takes off then this is where GPU will excel, still can't see any way without dedicated hardware decompressor, it costs $1 max.
 
  • Like
Reactions: marees

marees

Golden Member
Apr 28, 2024
1,516
2,116
96

Dense Geometry Format (DGF)​

End result: The prefilter and DGF nodes allow for a smaller BVH footprint, a massively reduced load on the memory subsystem, and permit fast low precision parallel bulk processing of triangle intersection tests. As a result a sizeable speedup is achieved while area investment for ray tri intersect logic is reduced.

Multiple Ray Tracing Patents Filings​

One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected.
IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)​

The Streaming Wave Coalescer implements thread coherency sorting similar to Intel's TSU and NVIDIA's SER implementations. It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler​

End result:
#1 Decentralized local scheduling: A decentralized GPU scheduling architecture that delegates scheduling to the lowest possible level in scheduling hierarchy while handing over almost complete scheduling autonomy to the Shader Engines (via WGS) and allowing WGPs to launch their own work. Improves scheduling latency and allows much more fine grained scheduling.
#2 Bottoms up scalable architecture: This is a bottom up instead of top down GPU scheduling paradigm. Everything operates on the assumption of local knows best although brakes are built into the system where higher scheduler takes control if a local scheduler is overloaded or can't feed its WGPs properly. Since each SE functions as its own GPU core scaling is no longer dictated by the scheduling capabilities of the global processor but how quickly it can prepare work and do load balancing across SEs.
#3 A boon for chiplet based GPUs: Preparing work in a global shared mailbox and doing some load balancing across SEs is far less demanding than micromanaging everything. As a result wider GPU designs should benefit the most and for chiplet based architectures the speedup could be even greater due to the latency mitigation and bottom up scheduling paradigm.

A Few Important Patents Filings​

The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."

 

marees

Golden Member
Apr 28, 2024
1,516
2,116
96

Dense Geometry Format (DGF)​

End result: The prefilter and DGF nodes allow for a smaller BVH footprint, a massively reduced load on the memory subsystem, and permit fast low precision parallel bulk processing of triangle intersection tests. As a result a sizeable speedup is achieved while area investment for ray tri intersect logic is reduced.

Multiple Ray Tracing Patents Filings​

One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:

"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."

It can be implemented via full or reduced precision and makes ray tracing more cost-effective

Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected.
IDK what benefits are to be expected here except perhaps lower BVH build cost and size.

Streaming Wave Coalescer (SWC)​

The Streaming Wave Coalescer implements thread coherency sorting similar to Intel's TSU and NVIDIA's SER implementations. It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.

The spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.

Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.

Local Launchers and Work Graph Scheduler​

End result:
#1 Decentralized local scheduling: A decentralized GPU scheduling architecture that delegates scheduling to the lowest possible level in scheduling hierarchy while handing over almost complete scheduling autonomy to the Shader Engines (via WGS) and allowing WGPs to launch their own work. Improves scheduling latency and allows much more fine grained scheduling.
#2 Bottoms up scalable architecture: This is a bottom up instead of top down GPU scheduling paradigm. Everything operates on the assumption of local knows best although brakes are built into the system where higher scheduler takes control if a local scheduler is overloaded or can't feed its WGPs properly. Since each SE functions as its own GPU core scaling is no longer dictated by the scheduling capabilities of the global processor but how quickly it can prepare work and do load balancing across SEs.
#3 A boon for chiplet based GPUs: Preparing work in a global shared mailbox and doing some load balancing across SEs is far less demanding than micromanaging everything. As a result wider GPU designs should benefit the most and for chiplet based architectures the speedup could be even greater due to the latency mitigation and bottom up scheduling paradigm.

A Few Important Patents Filings​

The RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.

Another patent filing talking about ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows "...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."

TL;DR for patents

Dense Geometry Format/DGF:
Block-based geometry compression. Reduces VRAM usage, BVH build size and reduces RT's load on memory and cache subsystems. Leveraging DGF nodes alongside prefiltering nodes to do bulk processing of triangles results in a significant speedup of triangle intersection testing at a lower area overhead than current method.

Configurable convex polygon ray/edge testing: Allows shares of results from edges between polygons eliminating duplicative inside/outside testing.

DMM patents: Three patents about AMD's Displaced Micro Meshes implementation. Looks related to NVIDIA's deprecated DMM implementation in Ada Lovelace. It uses bounding prisms on top of interpolated DMM (made from base triangle) to find ray intersections and alternatively subprisms within the prism.

Streaming Wave Coalescer (SWC) circuit: Does thread coherency sorting of divergent threads + spill-after programming model for devs to maximize benefits of SWC. Accomplishes same thing as NVIDIA's SER and Intel's TSU. Very important for path tracing.

Workgroup self-launch: Local launchers allows workgroup processors (WGPs) to launch work on their own independent of Shader Programming Interface (SPI). Also maintain queues and ressource management on their own but can lease ressources from SPI.

Work Graph Scheduler (WGS): Shader Engine (SE) level scheduler that operates independent from global scheduler except sharing mail box (global work item store) benefitting from load balancing across SEs. WGS results in finer grained scheduling and lower latency improving performance.

If a WGS is overloaded or WGPs within a SE are underutilized work items are migrated to other WGS/SEs via the global scheduler and global data share/mail box ensuring load balancing across SEs.
In addition one Asynchronous Dispatch controller operates under WGS within each SE. It builds waves and launches work for WGPs within each SE.

The WGS change allows GPU to be far more scalable as each scheduling domain is limited to a group of WGPs within a SEs. The global schedulers only job is to prepare work for the SEs and ensure even load balancing across Shader Engines. Scaling efficiency is no longer dictated by the frontends ability to schedule for entire GPU but how quickly it can load and prepare work items and do load balancing.

The scheduling changes proposed in patents are quite significant. They could be beneficial to branchy code such as work graphs and WGS would probably enable superior scaling for wider GPUs.

Fixed function emulation: Shaders could emulate fixed function HW and take over via a reconfigurable virtual graphics and compute processor pipelines when FF HW becomes overwhelmed.

Accelerated indirect draw fetching: Using accelerators would help speed up indirect draw fetching leading to lower computational latency and no longer making alignment of data structures important for performance.

 

Keller_TT

Member
Jun 2, 2024
146
170
76
My guess:

PS6 = 5070 ti == cut down xx70 version of RDNA 5 === xx60 xt version of RDNA 6
GPU wise, I think it'll be approx 9070, may be with a bit better RT. The CPU's big bump from Zen 2 to 6: I'm thinking approx. 7800X3D considering clocks and SoC TDP.

A 5070 Super with 18GB and up to 10% perf boost, that'll probably be a close approximation, with 7800X3D. PS5 is ~3700X + 2070S.

Plenty good for a console, with all the PlayStation tooling. Let'em bank great power efficiency and hitting PS5 launch prices (with a disc please). Sonly will likely hit higher launch reception than PS5.
 
  • Like
Reactions: Tlh97

Win2012R2

Golden Member
Dec 5, 2024
1,178
1,206
96
I think it'll be approx 9070, may be with a bit better RT.
There is no way RT will be "a bit better" - it will need to be x4 at least for PS6 to last another 7-8 years (that's till 2035, we should have flying cars on garbage by then)

Bit question what's the memory will be like - 24 GB VRAM probably at best, and if we lucky some extra slower LPDDR6, how else they'll get even small AI models if they don't have the RAM, this thing is supposed to last for a while
 
Last edited:
  • Like
Reactions: Tlh97 and Kryohi

Saylick

Diamond Member
Sep 10, 2012
3,975
9,310
136
That's pretty unrealistic, esp given the rumor of it also being 160 W for the system power draw.
Kepler also did say that the perf/W gains are overly optimistic, so if 9070 XT performance is achieved, it likely won't be as low as 160W.
 
  • Like
Reactions: Tlh97

gdansk

Diamond Member
Feb 8, 2011
4,434
7,471
136
That would imply if AMD created a GPU with twice the hardware as the PS6 GPU, they’d be able to hit 5090 levels of performance in a 350W TDP.
It doesn't imply anything about scaling up.
It would have to be about 35-40% more efficient than their existing N48 chop.
If there is a die shrink and some new IP it isn't unheard of.