Discussion RDNA 5 / UDNA (CDNA Next) speculation

adroc_thurston · Aug 27, 2025

Timorous said:
That is the exact reason I didn't believe the cache rumours for RDNA2. It seemed like going with a wider bus would cost less die area to achieve similar performance goals so if AMD were willing to spend 500mm of die area then going with a wider bus and adding more shader cores would be better balanced overall than a huge chunk of MALL.

Memory speed scaling looked pretty bad when MALL was introduced.

Basically we ran out of shrinks to spam shader cores with before we ran out of bandwidth.
Oh the irony.

SolidQ · Aug 27, 2025

Possible AMD RDNA 5 / UDNA GPU SKU Configs Point To 96, 40, 24, 12 CU Dies

Another possible set of GPU die configurations within the AMD RDNA 5 / UDNA lineup have been posted by Kepler_L2, pointing up to 96 CUs.

wccftech.com

https://videocardz.com/newz/amd-rdna5-rumors-point-to-at0-flagship-gpu-with-512-bit-memory-bus-96-compute-units

They fast publish news from this topic

Timorous · Aug 27, 2025

adroc_thurston said:
Memory speed scaling looked pretty bad when MALL was introduced.

Basically we ran out of shrinks to spam shader cores with before we ran out of bandwidth.
Oh the irony.

Just compare N22 to N10. Same 40CU shader count but 192bit bus and 96MB MALL vs 256 bit bus. The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

N23 also, performs like the 5070XT with half the bus due to the MALL but the MALL takes up the same area that a 128bit worth of GDDR6 PHYs would and would a 256 bit N23 have performed worse?

adroc_thurston · Aug 27, 2025

Timorous said:
N23 also, performs like the 5070XT with half the bus due to the MALL but the MALL takes up the same area that a 128bit worth of GDDR6 PHYs would and would a 256 bit N23 have performed worse?

Oh nyo it's smaller.
N33 even moreso.

ToTTenTranz · Aug 27, 2025

Timorous said:
The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

The problem with GDDR6 256bit bus is it would either be lacking with only 8GB VRAM in a >$400 card or having to pay all the way up to 16GB on a clamshell config.

poke01 · Aug 27, 2025

SolidQ said:
Possible AMD RDNA 5 / UDNA GPU SKU Configs Point To 96, 40, 24, 12 CU Dies

Another possible set of GPU die configurations within the AMD RDNA 5 / UDNA lineup have been posted by Kepler_L2, pointing up to 96 CUs.

wccftech.com

https://videocardz.com/newz/amd-rdna5-rumors-point-to-at0-flagship-gpu-with-512-bit-memory-bus-96-compute-units

They fast publish news from this topic

yeah they have bots i think or scrape sites

ToTTenTranz · Aug 27, 2025

One thing I don't really get.

When implemented in a SoC/APU configuration, what does an AT3 and AT4 get paired with exactly? Is it that Medusa Point IOD that already has 4xZen6 + 4xZen6c in it?
If so, doesn't that IOD already have 128bit LP5X/LP6 memory controllers and PHYs?

What happens when you pair e.g. an AT4 that also has 64bit LP5X/LP6? Is the AT4 treated as a dGPU with only 64bit LP5X (which is probably too narrow without lots of cache)?

Or do the memory controllers now work in parallel in an UMA fashion, so that both the CPU and the AT4 GPU get access to 192bit?
Is it like Apple's A5X where the GPU gets access to all the channels but the CPU can't? In that case, would the CPU only be able to access the 128bit in the IOD while the AT4 accesses its own 64bit + the IOD's 128bit?

Also, can the Medusa Point IOD pair with an AT3/4 and a 12 core Zen6 CCD at the same time?

In the chance that AT3 can be a client to the IOD's memory controllers as well as its own, it could result in a weird case where the dGPU version would have less memory bandwidth than the APU version, possibly resulting in a lower comparable performance for the dGPU.

Timorous · Aug 27, 2025

ToTTenTranz said:
The problem with GDDR6 256bit bus is it would either be lacking with only 8GB VRAM in a >$400 card or having to pay all the way up to 16GB on a clamshell config.

That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.

adroc_thurston said:
Oh nyo it's smaller.
N33 even moreso.

251mm for N10. 237mm for N23. It is not that much of a saving.

Josh128 · Aug 27, 2025

Joe NYC said:
Lots of SoCs. Interesting that Medusa Halo and Medusa Premium have separate dies - for not exactly high volume products. Probably only makes sense (from volume perspective) if these are shared with consoles.

Medusa Halo should also have optional CCD, which would mean 3 chip CPU.

MLID said all of the CCDs are V-Cache compatible. Which would leave Medusa Premium one SKU without V-Cache capability.

It would make a lot of sense if the different GPU dies can also stand alone as discrete low end graphics cards...its possible that they somehow might. UDNA= Unified DNA, but werent we just told Medusa Point was going to use RDNA 3.5 and not UDNA? Now its all UDNA?

branch_suggestion · Aug 27, 2025

ToTTenTranz said:
When implemented in a SoC/APU configuration, what does an AT3 and AT4 get paired with exactly? Is it that Medusa Point IOD that already has 4xZen6 + 4xZen6c in it?
If so, doesn't that IOD already have 128bit LP5X/LP6 memory controllers and PHYs?

Presumably Medusa Premium/Halo have bespoke SoC dies. Medusa Point is not compatible with AT4 for the reasons listed.
Remember ATx nomenclature includes a new term, GMD.
Graphics Memory Die, it hosts the GPU and overall memory for the SoC.
When used as a dGPU, it is paired with an MID, Multimedia I/O Die.
When used as an APU, it is paired with a SoC die such as Magnus with AT2.

ToTTenTranz said:
Also, can the Medusa Point IOD pair with an AT3/4 and a 12 core Zen6 CCD at the same time?

No. Wouldn't fit in FP10 for starters.

Josh128 · Aug 27, 2025

adroc_thurston said:
can you like read?
AMD is not shipping 10W tablet parts anymore.
They're gone-gone.

That's why Valve begged AMD for a semicustom slot.

Krackan Point can easily be cranked down to sub 10W if they chose to.

ToTTenTranz · Aug 27, 2025

Timorous said:
That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.

Nvidia always put only 8GB on all their 256bit GDDR6 (non-X) Geforce models:

TU106 - RTX 2060 Super, RTX 2070
TU104 - RTX 2070, RTX 2070 Super, RTX 2080, RTX 2080 Super
GA104 - RTX 3060 Ti, RTX 3070

In this particular case, you were talking about Navi 22 which is contemporary to the RTX 3060 Ti and RTX 3070 8GB, both of which were competing directly with the RX 6700XT 12GB.

My point was that one of the points in favor of AMD going with a 192bit GDDR6 + Infinity Cache instead of 256bit GDDR6 was to get an adequate amount of VRAM without having to go all the way up to 16GB.
A 256bit Navi 22 / 6700XT like you suggested would either get 8 or 16GB of GDDR6. 8 is too short, 16 is too much, 12GB ended up being adequate on the long term.

Josh128 said:
Krackan Point can easily be cranked down to sub 10W if they chose to.

I wish some OEM would make a Krackan Point handheld. Disable 2 of the big Zen5 cores (end up with 2x Zen5 + 4x Zen5c), pair it with 128bit LPDDR5X 8000, bring its power down to 12W and it should shine compared to the Steam Deck.
This example shows a Krackan Point at 18W beating the Rembrandt Z2 Go at 40W, despite using slower memory:

SolidQ · Aug 27, 2025

Infinity cache is removed from Rdna5 right?

ToTTenTranz · Aug 27, 2025

SolidQ said:
Infinity cache is removed from Rdna5 right?

Yes. It's mostly replaced by larger L2 and lower level caches.

MrMPFR · Aug 27, 2025

Timorous said:
Just compare N22 to N10. Same 40CU shader count but 192bit bus and 96MB MALL vs 256 bit bus. The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

N23 also, performs like the 5070XT with half the bus due to the MALL but the MALL takes up the same area that a 128bit worth of GDDR6 PHYs would and would a 256 bit N23 have performed worse?

MALL is all about perf/W. AMD sacrificed perf/area for perf/W with RDNA 2. 6900XT with 16gbps GDDR6 would have required 512bit bus. 512bit for 7900 XTX as well despite 20gbps.

This and clever power optimization (notice how some games in DF's launch tests uses far less watts) in addition to 4N is why Ada Lovelace was such a huge increase in perf/watt vs Ampere.

But MALL it's not a very elegant solution for a consumer GPU and uses too much area. NVIDIA did it better from the start with 40 series and AMD will do the same nextgen.

marees · Aug 27, 2025

Commenting on NVIDIA's RTX Hair technology on X, tech-savvy user LeviathanGamer posted a list of tech they think would vastly improve ray tracing performance, including
fast Matrix Math, for which RDNA4 architecture laid a lot of groundwork,

2x Intersection Testing,
unified LDS/L0 Cache,
Dedicated Stack Management and Traversal HW,
Coherency Sorting HW, and
3-coordinate decompression Geometry HW.

Commenting on this list, well-known AMD leaker Kepler L2 said on the NeoGAF forums that the next AMD GPU architecture that will power the PlayStation 6 and next-generation Xbox will have all this tech, and a lot more,

PlayStation 6, Xbox Next Will Feature Plenty of Ray Tracing Performance Improving Tech, But RT Without Compromise Is Still Decades Away

The PlayStation 6 and next-generation Xbox will feature plenty of ray tracing performance improving tech, but it will be years until we will see ray tracing without compromise.

wccftech.com

dangerman1337 · Aug 27, 2025

Kepler_L2 said:
I could not resist the urge to schizo post

View attachment 129189
View attachment 129190
View attachment 129191
View attachment 129193

Hmmmm, so RDNA5 CUs have 2x the Shader count Vs RDNA 4 CUs? I mean those 96 CUs with a 512-bit bus with 36Gbps 3GB GDDR7 modules gotta be really stronk CUs, though the 3rd diagram having more UMCs than the 2nd one does mean 256-bit LPDDR5X for AT3 & 4?

That said no AT1 in between 0 & 2 is percularly, maybe AMD is working on that with 320-bit with 120/60 CUs?

gdansk · Aug 27, 2025

dangerman1337 said:
I mean those 96 CUs with a 512-bit bus with 36Gbps 3GB GDDR7 modules gotta be really stronk CUs,

Kepler_L2 said:
Yeah, MI400 has the WGP-sized CU with WGP mode and also wave64 support deprecated, and I think this carries over to gfx13.

marees · Aug 27, 2025

Surprisingly both videocardz & wccftech have missed this crucial info

1 RDNA 4 WGP == 1 RDNA 5 CU

Saylick · Aug 27, 2025

marees said:
Commenting on NVIDIA's RTX Hair technology on X, tech-savvy user LeviathanGamer posted a list of tech they think would vastly improve ray tracing performance, including
fast Matrix Math, for which RDNA4 architecture laid a lot of groundwork,

2x Intersection Testing,

unified LDS/L0 Cache,

Dedicated Stack Management and Traversal HW,

Coherency Sorting HW, and

3-coordinate decompression Geometry HW.

Commenting on this list, well-known AMD leaker Kepler L2 said on the NeoGAF forums that the next AMD GPU architecture that will power the PlayStation 6 and next-generation Xbox will have all this tech, and a lot more,

PlayStation 6, Xbox Next Will Feature Plenty of Ray Tracing Performance Improving Tech, But RT Without Compromise Is Still Decades Away

The PlayStation 6 and next-generation Xbox will feature plenty of ray tracing performance improving tech, but it will be years until we will see ray tracing without compromise.

wccftech.com

Wccftech literally either scrapes the various forums, or at least searches them using the names of known leakers, or there's someone among us who goes out of their way to tip them off. First it was sourcing Kepler with the RDNA 5 block diagrams here on AT Forums, and now they've sourced Kepler at Neogaf. Who is the turncoat here... *rubs chin* (jk)

Hassan, if you ever read this, you and your clickbaity website can go F yourselves.

basix · Aug 27, 2025

Kepler_L2 said:
CDNA4 is 32KB L0 + 160KB LDS, CDNA5 is 448KB Shared L0/LDS

Magras00 said:
Is RDNA5 taking this even further like Apple's M3? https://developer.apple.com/videos/play/tech-talks/111375/ (from 11:37)
[...]
M3 and A17 Pro even has thread occupancy management to avoid cache spillovers to higher level caches. Maybe this is also part of how RDNA5 shrinks L2.

Interesting stuff. If I sum up SRAM caches of 2x CDNA4 CU (L1, LDS, Instruction Cache) I am landig at 512kB (if I am counting right). Because instructions can probably be shared -> 448kB would be the number.

RDNA4 already incorporates dynamic / out of order register allocation (as M3 does). M3 then goes further and unifies its local caches to one big one, which we now might see on CDNA5 and RDNA5. But it seems, that the register files do not get merged with LDS and L0?
M3 does then also add parallel FP16/FP32/INT execution. Not sure if RDNA would benefit that much from that but if thinking at work graphs and dynamic execution, such overlapping operation might make sense (if not already possible on RDNA).

For me it is very reasonable, that shared caches enhance the utilization rate. But the physical HW-Implementation might be more difficult and/or some latencies might degrade. N3P to the rescue, I assume

Edit:
Maybe it is a 512kB SRAM macro, 448kB = L1/LDS replacement and 64kByte = Dedicated Instruction Cache?

ToTTenTranz · Aug 27, 2025

branch_suggestion said:
Presumably Medusa Premium/Halo have bespoke SoC dies. Medusa Point is not compatible with AT4 for the reasons listed.

What reasons?

branch_suggestion said:
Remember ATx nomenclature includes a new term, GMD.
Graphics Memory Die, it hosts the GPU and overall memory for the SoC.
When used as a dGPU, it is paired with an MID, Multimedia I/O Die.
When used as an APU, it is paired with a SoC die such as Magnus with AT2.

But the SoC die in APU setup doesn't have a memory controller or PHYs? A SoC die with AT4 only gets to access 64bit LPDDR5X?

Josh128 · Aug 27, 2025

Saylick said:
Wccftech literally either scrapes the various forums, or at least searches them using the names of known leakers, or there's someone among us who go out of there way to tip them off. First it was sourcing Kepler with the RDNA 5 block diagrams here on AT Forums, and now they've sourced Kepler at Neogaf. Who is the turncoat here... *rubs chin* (jk)

Hassan, if you ever read this, you and your website can go F yourself.

I dont know how fast they copypasta'd that info, but I know for a fact there used to be some retarded forum trolls that lurk here but dont post anything here, but instead post things over there. Same for Twitter, Reddit, and NeoGAF.

basix · Aug 27, 2025

ToTTenTranz said:
But the SoC die in APU setup doesn't have a memory controller or PHYs? A SoC die with AT4 only gets to access 64bit LPDDR5X?

No, AT4 delivers 128bit LPDDR5X (or probably also 192bit LPDDR6 as option) -> Dual-Channel DRAM in consumer / CPU terms

adroc_thurston · Aug 27, 2025

ToTTenTranz said:
A SoC die with AT4 only gets to access 64bit LPDDR5X?

I think the M in 'GMD' stands for 'memory'.
I.e. all the DDR shoreline is there.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Diamond Member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Member

Golden Member

Senior member

Diamond Member

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Diamond Member