Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 41 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

adroc_thurston

Diamond Member
Jul 2, 2023
6,362
8,963
106
That is the exact reason I didn't believe the cache rumours for RDNA2. It seemed like going with a wider bus would cost less die area to achieve similar performance goals so if AMD were willing to spend 500mm of die area then going with a wider bus and adding more shader cores would be better balanced overall than a huge chunk of MALL.
Memory speed scaling looked pretty bad when MALL was introduced.

Basically we ran out of shrinks to spam shader cores with before we ran out of bandwidth.
Oh the irony.
 
  • Haha
Reactions: GodisanAtheist

Timorous

Golden Member
Oct 27, 2008
1,973
3,854
136
Memory speed scaling looked pretty bad when MALL was introduced.

Basically we ran out of shrinks to spam shader cores with before we ran out of bandwidth.
Oh the irony.

Just compare N22 to N10. Same 40CU shader count but 192bit bus and 96MB MALL vs 256 bit bus. The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

N23 also, performs like the 5070XT with half the bus due to the MALL but the MALL takes up the same area that a 128bit worth of GDDR6 PHYs would and would a 256 bit N23 have performed worse?
 

ToTTenTranz

Senior member
Feb 4, 2021
528
952
136
The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

The problem with GDDR6 256bit bus is it would either be lacking with only 8GB VRAM in a >$400 card or having to pay all the way up to 16GB on a clamshell config.
 

poke01

Diamond Member
Mar 8, 2022
4,020
5,346
106
  • Like
Reactions: Saylick and marees

ToTTenTranz

Senior member
Feb 4, 2021
528
952
136
One thing I don't really get.

When implemented in a SoC/APU configuration, what does an AT3 and AT4 get paired with exactly? Is it that Medusa Point IOD that already has 4xZen6 + 4xZen6c in it?
If so, doesn't that IOD already have 128bit LP5X/LP6 memory controllers and PHYs?

What happens when you pair e.g. an AT4 that also has 64bit LP5X/LP6? Is the AT4 treated as a dGPU with only 64bit LP5X (which is probably too narrow without lots of cache)?

Or do the memory controllers now work in parallel in an UMA fashion, so that both the CPU and the AT4 GPU get access to 192bit?
Is it like Apple's A5X where the GPU gets access to all the channels but the CPU can't? In that case, would the CPU only be able to access the 128bit in the IOD while the AT4 accesses its own 64bit + the IOD's 128bit?

Also, can the Medusa Point IOD pair with an AT3/4 and a 12 core Zen6 CCD at the same time?



In the chance that AT3 can be a client to the IOD's memory controllers as well as its own, it could result in a weird case where the dGPU version would have less memory bandwidth than the APU version, possibly resulting in a lower comparable performance for the dGPU.
 

Timorous

Golden Member
Oct 27, 2008
1,973
3,854
136
The problem with GDDR6 256bit bus is it would either be lacking with only 8GB VRAM in a >$400 card or having to pay all the way up to 16GB on a clamshell config.

That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.

Oh nyo it's smaller.
N33 even moreso.

251mm for N10. 237mm for N23. It is not that much of a saving.
 

Josh128

Golden Member
Oct 14, 2022
1,159
1,750
106
Lots of SoCs. Interesting that Medusa Halo and Medusa Premium have separate dies - for not exactly high volume products. Probably only makes sense (from volume perspective) if these are shared with consoles.

Medusa Halo should also have optional CCD, which would mean 3 chip CPU.

MLID said all of the CCDs are V-Cache compatible. Which would leave Medusa Premium one SKU without V-Cache capability.
It would make a lot of sense if the different GPU dies can also stand alone as discrete low end graphics cards...its possible that they somehow might. UDNA= Unified DNA, but werent we just told Medusa Point was going to use RDNA 3.5 and not UDNA? Now its all UDNA?
 
Last edited:

branch_suggestion

Senior member
Aug 4, 2023
749
1,619
106
When implemented in a SoC/APU configuration, what does an AT3 and AT4 get paired with exactly? Is it that Medusa Point IOD that already has 4xZen6 + 4xZen6c in it?
If so, doesn't that IOD already have 128bit LP5X/LP6 memory controllers and PHYs?
Presumably Medusa Premium/Halo have bespoke SoC dies. Medusa Point is not compatible with AT4 for the reasons listed.
Remember ATx nomenclature includes a new term, GMD.
Graphics Memory Die, it hosts the GPU and overall memory for the SoC.
When used as a dGPU, it is paired with an MID, Multimedia I/O Die.
When used as an APU, it is paired with a SoC die such as Magnus with AT2.
Also, can the Medusa Point IOD pair with an AT3/4 and a 12 core Zen6 CCD at the same time?
No. Wouldn't fit in FP10 for starters.
 
  • Like
Reactions: Joe NYC and marees

ToTTenTranz

Senior member
Feb 4, 2021
528
952
136
That is for a 128 bit bus. 256bit with 2GB chips supports 16GB as standard or 32GB with clamshell.

Nvidia always put only 8GB on all their 256bit GDDR6 (non-X) Geforce models:

TU106 - RTX 2060 Super, RTX 2070
TU104 - RTX 2070, RTX 2070 Super, RTX 2080, RTX 2080 Super
GA104 - RTX 3060 Ti, RTX 3070


In this particular case, you were talking about Navi 22 which is contemporary to the RTX 3060 Ti and RTX 3070 8GB, both of which were competing directly with the RX 6700XT 12GB.

My point was that one of the points in favor of AMD going with a 192bit GDDR6 + Infinity Cache instead of 256bit GDDR6 was to get an adequate amount of VRAM without having to go all the way up to 16GB.
A 256bit Navi 22 / 6700XT like you suggested would either get 8 or 16GB of GDDR6. 8 is too short, 16 is too much, 12GB ended up being adequate on the long term.


Krackan Point can easily be cranked down to sub 10W if they chose to.
I wish some OEM would make a Krackan Point handheld. Disable 2 of the big Zen5 cores (end up with 2x Zen5 + 4x Zen5c), pair it with 128bit LPDDR5X 8000, bring its power down to 12W and it should shine compared to the Steam Deck.
This example shows a Krackan Point at 18W beating the Rembrandt Z2 Go at 40W, despite using slower memory:

 

Magras00

Member
Aug 9, 2025
33
72
46
Just compare N22 to N10. Same 40CU shader count but 192bit bus and 96MB MALL vs 256 bit bus. The ~35% die size increase led to a roughly 40% performance bump but I can't help but feel that sticking with a 256bit bus and just adding in more shaders would not have given more performance at the same ~335mm die area that N22 did use than going with the big L3 cache.

N23 also, performs like the 5070XT with half the bus due to the MALL but the MALL takes up the same area that a 128bit worth of GDDR6 PHYs would and would a 256 bit N23 have performed worse?

MALL is all about perf/W. AMD sacrificed perf/area for perf/W with RDNA 2. 6900XT with 16gbps GDDR6 would have required 512bit bus. 512bit for 7900 XTX as well despite 20gbps.

This and clever power optimization (notice how some games in DF's launch tests uses far less watts) in addition to 4N is why Ada Lovelace was such a huge increase in perf/watt vs Ampere.

But MALL it's not a very elegant solution for a consumer GPU and uses too much area. NVIDIA did it better from the start with 40 series and AMD will do the same nextgen.
 
  • Like
Reactions: marees

marees

Golden Member
Apr 28, 2024
1,485
2,099
96
Commenting on NVIDIA's RTX Hair technology on X, tech-savvy user LeviathanGamer posted a list of tech they think would vastly improve ray tracing performance, including
fast Matrix Math, for which RDNA4 architecture laid a lot of groundwork,
  1. 2x Intersection Testing,
  2. unified LDS/L0 Cache,
  3. Dedicated Stack Management and Traversal HW,
  4. Coherency Sorting HW, and
  5. 3-coordinate decompression Geometry HW.

Commenting on this list, well-known AMD leaker Kepler L2 said on the NeoGAF forums that the next AMD GPU architecture that will power the PlayStation 6 and next-generation Xbox will have all this tech, and a lot more,

 

dangerman1337

Senior member
Sep 16, 2010
376
36
91
Hmmmm, so RDNA5 CUs have 2x the Shader count Vs RDNA 4 CUs? I mean those 96 CUs with a 512-bit bus with 36Gbps 3GB GDDR7 modules gotta be really stronk CUs, though the 3rd diagram having more UMCs than the 2nd one does mean 256-bit LPDDR5X for AT3 & 4?

That said no AT1 in between 0 & 2 is percularly, maybe AMD is working on that with 320-bit with 120/60 CUs?
 
  • Like
Reactions: Joe NYC and marees

Saylick

Diamond Member
Sep 10, 2012
3,972
9,281
136
Commenting on NVIDIA's RTX Hair technology on X, tech-savvy user LeviathanGamer posted a list of tech they think would vastly improve ray tracing performance, including
fast Matrix Math, for which RDNA4 architecture laid a lot of groundwork,
  1. 2x Intersection Testing,
  2. unified LDS/L0 Cache,
  3. Dedicated Stack Management and Traversal HW,
  4. Coherency Sorting HW, and
  5. 3-coordinate decompression Geometry HW.

Commenting on this list, well-known AMD leaker Kepler L2 said on the NeoGAF forums that the next AMD GPU architecture that will power the PlayStation 6 and next-generation Xbox will have all this tech, and a lot more,

Wccftech literally either scrapes the various forums, or at least searches them using the names of known leakers, or there's someone among us who go out of there way to tip them off. First it was sourcing Kepler with the RDNA 5 block diagrams here on AT Forums, and now they've sourced Kepler at Neogaf. Who is the turncoat here... *rubs chin* (jk)

Hassan, if you ever read this, you and your website can go F yourself.
 

basix

Member
Oct 4, 2024
183
365
96
CDNA4 is 32KB L0 + 160KB LDS, CDNA5 is 448KB Shared L0/LDS
Is RDNA5 taking this even further like Apple's M3? https://developer.apple.com/videos/play/tech-talks/111375/ (from 11:37)
[...]
M3 and A17 Pro even has thread occupancy management to avoid cache spillovers to higher level caches. Maybe this is also part of how RDNA5 shrinks L2.

Interesting stuff. If I sum up SRAM caches of 2x CDNA4 CU (L1, LDS, Instruction Cache) I am landig at 512kB (if I am counting right). Because instructions can probably be shared -> 448kB would be the number.

RDNA4 already incorporates dynamic / out of order register allocation (as M3 does). M3 then goes further and unifies its local caches to one big one, which we now might see on CDNA5 and RDNA5. But it seems, that the register files do not get merged with LDS and L0?
M3 does then also add parallel FP16/FP32/INT execution. Not sure if RDNA would benefit that much from that but if thinking at work graphs and dynamic execution, such overlapping operation might make sense (if not already possible on RDNA).

For me it is very reasonable, that shared caches enhance the utilization rate. But the physical HW-Implementation might be more difficult and/or some latencies might degrade. N3P to the rescue, I assume ;)

Edit:
Maybe it is a 512kB SRAM macro, 448kB = L1/LDS replacement and 64kByte = Dedicated Instruction Cache?
 
Last edited:

ToTTenTranz

Senior member
Feb 4, 2021
528
952
136
Presumably Medusa Premium/Halo have bespoke SoC dies. Medusa Point is not compatible with AT4 for the reasons listed.
What reasons?


Remember ATx nomenclature includes a new term, GMD.
Graphics Memory Die, it hosts the GPU and overall memory for the SoC.
When used as a dGPU, it is paired with an MID, Multimedia I/O Die.
When used as an APU, it is paired with a SoC die such as Magnus with AT2.


But the SoC die in APU setup doesn't have a memory controller or PHYs? A SoC die with AT4 only gets to access 64bit LPDDR5X?
 

Josh128

Golden Member
Oct 14, 2022
1,159
1,750
106
Wccftech literally either scrapes the various forums, or at least searches them using the names of known leakers, or there's someone among us who go out of there way to tip them off. First it was sourcing Kepler with the RDNA 5 block diagrams here on AT Forums, and now they've sourced Kepler at Neogaf. Who is the turncoat here... *rubs chin* (jk)

Hassan, if you ever read this, you and your website can go F yourself.
I dont know how fast they copypasta'd that info, but I know for a fact there used to be some retarded forum trolls that lurk here but dont post anything here, but instead post things over there. Same for Twitter, Reddit, and NeoGAF.
 
  • Like
Reactions: Saylick and marees