Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 46 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

branch_suggestion

Senior member
Aug 4, 2023
826
1,805
106
 

marees

Golden Member
Apr 28, 2024
1,752
2,381
96
Too much joining (unrelated) dots based on just a linked in profile
 

Bigos

Senior member
Jun 2, 2019
204
519
136
I am not buying the "new CU is the old WGP renamed" narrative. It would imply huge dies with not enough memory bandwidth. For example AT0 compared to N44 would have 3x SIMDs with just 2x bus width. This goes against the narrative that RDNA5 CUs are "stronger".

What I am thinking right now is that RDNA5 gets better use of VODP and thus can better dual-issue operations. Thus there is better utilization of the SIMDs. So someone thought "great, that means there are 2x CUs now, as one CU can do 2x work, right?". And this is how the 2x CU number came about. It's the return of RDNA3 rumors where people just 2x everything because of VOPD.

If that is right, AT2 will have 40 CUs and AT0 will have 96 CUs, not WGP. These CUs will have the same theoretical FLOPS per clock as RDNA4. However, with better VOPD utilization, AT2 could have ~N44 performance and AT0 would be 2.4x faster (though with probably lower clocks and scaling is never perfect so it will be slower than that FPS-wise) which is 5090-and-beyond territory.

Note that I am 100% speculating right here and can be proven wrong. I also might not have read all of the posts here carefully enough and this might already be common knowledge.
 
  • Like
Reactions: Tlh97

basix

Senior member
Oct 4, 2024
249
504
96
You are missing faster GDDR7 from the SIMD vs. bus width consideration. So you get 3x SIMD-Width together with 3.2x memory bandwidth compared to N48 (20 Gbps GDDR6 vs. 32 Gbps GDDR7). I do not see a conflict there.

RDNA5 should get more bandwidth efficient as well (bigger and unified cache on CU level, enhanced out-of-order execution capabilites, VOPD etc. enhancements). This creates room for "stronger CUs" (aka more performace per CU) without even higher memory bandwidth.

Another topic regarding AT2:
First it was speculated to feature 36 CU. Now many people assume 40 CU (due to @Kepler_L2 latest drawings). I somehow find that intriguing, because AT1 was cancelled.
- Bump AT2 from 36 to 40 CU -> create more space to AT3 (AD105 vs. AD104 are similar -> 50 SM vs. 84 SM) // narrow performance gap to what AT1 could have been (96 CU) and remove one chip (AT1) from the lineup
- Bump GDDR7 speed from 32 Gbps to 36 Gbps for the top end AT2 SKU to equalize FLOPS vs. memory bandwidth
 
Last edited:

dangerman1337

Senior member
Sep 16, 2010
384
45
91
That means it's either going to be a lackluster generational increase for the 9070 XT upgrade, or a very expensive 1090 XT unless they have actually finally gone for chiplets.
Seems to be a "Graphics Memory Die" (aka that is AT0, 2, 3, 4 that's paired with an associated die from a cheap dGPU media die to a elaborate Console SoC allowing economies of scale but not going right now into risky.

Honestly if AMD is doing 512-bit GMDs that are 700MM2 and over then they're going from the performance crown Vs GB202 successor. And it'd be embarrassing for AMD to go with RDNA5 "here's a big and revolutionary leap from us... but nah we ain't going for the high end unlike we did with RDNA 2".
You are missing GDDR7 from the SIMD vs. bus width consideration. So you get 3x SIMD-Width together with 3.2x memory bandwidth compared to N48. I do not see a conflict there.

RDNA5 should get more bandwidth efficient as well (bigger and unified cache on CU level, enhanced out-of-order execution capabilites, VOPD etc. enhancements)
Yeah if it's 36 Gbps GDDR7 with RDNA 5 that indicates they're going for high performance for GMDs that have it. Why have GDDR7 with 512-bit bus if you're going to do a measly 45-55% performance jump over the 9070 XT? 512-bit bus + 36 Gbps GDDR7 is 3.6x the bandwidth of the 9070 XT if AMD decides to release a full-fat gaming variant of AT0 w/ 48GB of VRAM. That doesn't sound like a "normal" apples to apples node shrunk generational leap for the fastest card.
 

MrMPFR

Member
Aug 9, 2025
103
207
71
The Chiphell guy pretty much concurs with everything said so far, 4 dies, 2027 launch, very specifically no AT1.
Oh and also the doubling of CU size.
Can you share links to the specific comments? Couldn't find them.

Edit: Disputed info about cache hierarchy characteristics of NVIDIA and AMD. Don't take this as fact. Skip to #1201 for the quote from the Volta tuning guide.

The L0+LDS merger is long overdue imo. NVIDIA merged texture and shared cache into L1 with Turing. Maxwell Pascal had unified Texture (really a dcache)/L1 cache + shared memory. Seems the local SRAM stores define what a core is more than actual processing blocks. The NVIDIA SM has been partitioned in four processing blocks with local warp since Maxwell in 2014.

RDNA4 CU is also split in two partitions, within a WGP that's four in total. Merging L0 into one would be enough to qualify WGP as a CU, but AMD obviously want a shared LDS+L0 like NVIDIA.This is simplified and might be misleading. It's not about merging seperate caches it's likely about permitting an underlying combined datacache to be more flexible. Slim chance of flexible cache like Apple M3 but even then the VGPR would stil be partitioned for each SIMD32, but the VGPR size could be variable.
Unless AMD shrinks the new CU partitions in half (aligning with Turing and later) I just can't see how MLIDs CU numbers would be possible. For reference using Ampere's bogus math AMD has 256 ROCm cores per WGP rn. NVIDIA has 128 CUDA cores per SM.
IDK how RDNA 4's CU partitions work but would think they can either do Wave64 or Wave32 x 2 based on specs (this might be incorrect). If RDNA 5 shrinks partions in half, then maybe AMD might do something insane like Wave32 or Wave16 x 2 to brute force PT and branchy code. Or alternatively if they don't shrink CUs maybe a Wave32 + Wave16 x 2 mode or perhaps a Wave24 x 2 + Wave14 mode. They could get really creative here with a sophisticated warp scheduler coupled with local launcher functionality like in the Kepler patent except even better. Wave32 x 2 - Wave16 x 4 in the big design or Wave32 - Wave16 x 2 in the small design. Big partition design (WGP = CU) has more flexibility here.

This is far fetched speculation so please don't take it too seriously.

What are workgraphs inherently good at? Branchy code so why not align the dispatch mode with this (fewer threads/warp). If AMD fixed the scheduling (would require significant area investment) then this would be extremely beneficial to path tracing and branchy code in general. Perhaps workgraph is good enough in itself and then in addition to SWC AMD could implement clever methods like ray coherency sorting similar to Imagination Technologies' Packet coherency gatherer and this would prevent the instead of relying on a bandaid (thread coherency sorting).

What about the split registers and instruction caches? I
- Ignore this, most likely misleading or false. I couldn't find any accurate info to suggest otherwise

Still wonder what will happen with the instruction and scalar cache. NVIDIA seems to have unified both since they only mention L0-i cache. Same thing for VRF. I did see mentions about NVIDIA going scalar since Tesla G80 and with Turing superscalar. GCN was a scalar and vector hybrid, RDNA4 doesn't change anything here. Essentially NVIDIA is SIMT and AMD is scalar + SIMD. Explains why AMD has to maintain seperate VGPR + SGPR and scalar cache + instruction cache while NVIDIA can just have a L0-i cache + VRF.

As you can clearly see I don't understand the implications of any of this so it's well above my level of comprehension but still thinks it deserves to at least be discussed.
RDNA5 is the biggest architectural change GCN so is there any chance they could theoretically go the NVIDIA route and go scalar instead of vector+scalar hybrid? That would mean no more split VGPR+SGPR and no split instruction and scalar cache right?
But wouldn't it also ruin BWC for consoles and be a huge pain for AMD SW team + CDNA 4 LLVM code still mentions SIMD for GFX12.5. Can someone please say no to ^ :tearsofjoy: Please don't it's futile
 
Last edited:
  • Like
Reactions: Saylick

MrMPFR

Member
Aug 9, 2025
103
207
71
Correct me if I'm wrong but isn't AIDs already used in CDNA3-4?

Looks like the patent is about avoiding signal routing congestion in central dies, load-balancing between AIDs + runtime optimization based on variable silicon characteristics: non fatal defects that impact V/F curve (pJ/bit). They want everything to be consistent acros all AIDs. Sounds like AMD is preparing a mega-MCM AI accelerator to counter Rubin Ultra.
Still can't see how this would benefit gaming and professional market. No way non DC goes beyond 1 AID. AID+SEDs+everything else well above 1000mm^2 at AID < reticle limit.
 
  • Like
Reactions: basix

branch_suggestion

Senior member
Aug 4, 2023
826
1,805
106
Can you share links to the specific comments? Couldn't find them.
 

MrMPFR

Member
Aug 9, 2025
103
207
71

#4: "The structure has changed, and a CU is 128SP, which is not much different from the previous 192CU."
Caveat - Used Google translate

Minimum mode for LDS = 0kB (unlikely) according to Kepler tweet, but assuming conservative mode with 64KB LDS at least 128kB available and shared between two ray accelerators. +4x total, +2x per ray accelerator. If they match NVIDIA Blackwell then L0 per RA now tripled or 6x larger data store shared between two RAs 🤯🚀
This is misleading since NVIDIA has a unified Texture/L1 cache since Maxwell. Not comparable to to AMD's design as I highly doubt LDS = shared memory. But larger L0 allocation modes should still benefit RT significantly.
 
Last edited:

MrMPFR

Member
Aug 9, 2025
103
207
71
20 vs 36Gbps pin rates on memory.
Come tf on.

they're going further than that.

Says who? Last time I checked Kepler isn't even 100% sure on L0+LDS change, although this seems like the only logical outcome.


Also keep running into people with the most absurd takes everywhere. They don't think any of the patents will happen despite Kepler already listed many of them, said RDNA5 was Blackwell and then some, and that GFX13 is largest overhaul since GCN and that AMD is changing everything.
One even thought RT BVH traversal in HW prob won't be coming to RDNA5.
Haven't seen Kepler exactly specify the implementation in patents = this is HW except for one patent in my google docs. Think it was #15 under general. The patent links was a response to a post about what beyond Blackwell could be. So not a leak confirming this is how AMD will do the stuff for sure.

So can I kindly ask for some clarity here @Kepler_L2? As of today which of these are you absolutely certain (no speculation) will be in RDNA5? Don't it's too early to be certain. But let's keep the quotes here for posterity. Wonder how many of these will happen.
  1. Dedicated circuitry for BVH traversal processing
  2. Thread coherency sorting (SWC)
  3. Local launchers that launch work within each CU and can also deviate from the SE scheduling
  4. Distributed hierarchical SE level scheduling (WGS + ADC) and load balancing via "work stealing"
  5. Distributed geometry processing matching #4
  6. OMM support
  7. DGF support in HW - already confirmed by AMD
  8. 3-point geometry HW decompression (DGF + prefiltering decompression engine)
  9. Displaced Micro Meshes
  10. Quantized OBBs using platonic solids
  11. Low precision parallel INT ray/tri intersectors acting as a prefiltering step before full precision FP intersectors
  12. Fundamental HW optimizations for GPU workgraphs (#2-4 + maybe more?)
  13. Linear swept spheres
  14. Ray/edge shared testing
  15. RTX mega geometry like functionality built into HW
  16. Unified L0+LDS
  17. WGP deprecated and doubled CU (SIMD64 x 4)
Anything to add I might have missed? Kepler seems unwilling to spill any more beans for now so let's just wait till more LLVM hints and maybe patents that are certain appear.
 
Last edited:

adroc_thurston

Diamond Member
Jul 2, 2023
7,099
9,853
106
Says who? Last time I checked Kepler isn't even 100% sure on L0+LDS change, although this seems like the only logical outcome.


Also keep running into people with the most absurd takes everywhere. They don't think any of the patents will happen despite Kepler already listed many of them, said RDNA5 was Blackwell and then some, and that GFX13 is largest overhaul since GCN and that AMD is changing everything.
One even thought RT BVH traversal in HW prob won't be coming to RDNA5.

So can I kindly ask for some clarity here @Kepler_L2? As of today which of these are you absolutely certain (no speculation) will be in RDNA5?
  1. Dedicated circuitry for BVH traversal processing
  2. Thread coherency sorting (SWC)
  3. Local launchers that launch work within each CU and can also deviate from the SE scheduling
  4. Distributed hierarchical SE level scheduling (WGS + ADC) and load balancing via "work stealing"
  5. Distributed geometry processing matching #4
  6. OMM support
  7. DGF support in HW - already confirmed by AMD
  8. 3-point geometry HW decompression (DGF + prefiltering decompression engine)
  9. Displaced Micro Meshes
  10. Quantized OBBs using platonic solids
  11. Low precision parallel INT ray/tri intersectors acting as a prefiltering step before full precision FP intersectors
  12. Fundamental HW optimizations for GPU workgraphs (#2-4 + maybe more?)
  13. Linear swept spheres
  14. Ray/edge shared testing
  15. RTX mega geometry like functionality built into HW
  16. Unified L0+LDS
  17. WGP deprecated and doubled CU (SIMD64 x 4)
Anything to add I might have missed?
Man that's a lot of words and none of them mean anything.
All you need to know is that cachemem setup is wayyyy different (again).
That's all.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,811
1,544
136
Would be really nice if the cache hierarchy between AMD and NV finally standardized enough that, when writing a compute shader, you could assume that if it's performant on one that it will also be on the other.
 

soresu

Diamond Member
Dec 19, 2014
4,105
3,566
136
Would be really nice if the cache hierarchy between AMD and NV finally standardized enough that, when writing a compute shader, you could assume that if it's performant on one that it will also be on the other.
Isn't that what intermediate representations and compilers are for?
 

RnR_au

Platinum Member
Jun 6, 2021
2,675
6,123
136
For examples of gpu's using mobile ram...

Huawei Atlas 300V Pro 48GB
https://e.huawei.com/cn/products/computing/ascend/atlas-300v-pro
48GB LPDDR4x at 204.8GB/s
140 TOPS INT8, 70 TFLOPS FP16

Huawei Atlas 300i Duo 96GB
https://e.huawei.com/cn/products/computing/ascend/atlas-300i-duo
96GB or 48GB LPDDR4X at 408GB/s, supports ECC
280 TOPS INT8, 140 TFLOPS FP16

Both are single slot @150W. From an AI inference perspective, the Duo seems to be about half the speed in token gen of a 3090 with slightly faster prompt processing - assuming drivers at up to it.

The Duo 96GB is selling today on Alibaba for ~$US1500, lower with a sale.

Source
 

basix

Senior member
Oct 4, 2024
249
504
96
That is exactly what I am expecting to be one of the use cases for AT3. For AT4 potentially as well but the chip really tiny. But the 140 TOPS figure and 200 GByte/s (or even more with LPDDR6) of the Atlas 300V should be well in range.

A shared MID with dual video encoders/decoders and JPEG/MJPEG codec acceleration would also be neat here, as those features are often used for ML/AI use cases.
AT4 in monolithic fashion would for sure only feature one encoder/decoder block. Getting two blocks via MID enhances its ML/AI feature set.
 
Last edited:
  • Like
Reactions: marees

adroc_thurston

Diamond Member
Jul 2, 2023
7,099
9,853
106
That is exactly what I am expecting to be one of the use cases for AT3. For AT4 potentially as well but the chip really tiny. But the 140 TOPS figure and 200 GByte/s (or even more with LPDDR6) of the Atlas 300V should be well in range.

A shared MID with dual video encoders/decoders and JPEG/MJPEG codec acceleration would also be neat here, as those features are often used for ML/AI use cases.
AT4 in monolithic fashion would for sure only feature one encoder/decoder block. Getting two blocks via MID enhances its ML/AI feature set.
The bubble will be long dead by the time this launches.
They're gaming parts. For video games.
You know, drawing triangles and stuff.
 
  • Haha
Reactions: yottabit and marees

basix

Senior member
Oct 4, 2024
249
504
96
Primarily for drawing triangles, for sure. I am not denying that ;)

But if AMD sees an opportunity to use them in different ways, they could. And because of the LPDDRx memory rather easily with huge amounts of memory.

That is the main thing:
They could. Nobody says they will. This all depends on market conditions and customer requests.
In uncertain market conditions and trends it is wise to have an ace in the sleeves when required. Especially if the ace does not mean any additional effort (because chip gets anyway used for both Medusa parts and dGPUs, support huge amounts of memory and feature decent ML/AI acceleration). If the ace is not required, leave it in the sleeve.

And I am not sure about the bubble bursting, though. There are many ML/AI use cases for professional workstations, which are getting deployed already. I am specifically not talking about LLMs but about other DNN applications in video / image / CAD processing.
Medusa and professional workstation cards with much memory, ML/AI acceleration and still being capable of drawing graphics on your screen is a very neat combination.
Using the very same chip or even card (you already have a workstation dGPU) in servers is simply no additional effort for AMD. Today, some CAD stuff runs in the cloud, perfect use case for such cards. There, memory capacity is often a bigger constraint than compute power.

You should recalibrate your "gaming GPUs are only for gaming" and "ML/AI = LLM only" points of view. The world is much bigger than that ;)
These additional use cases won't be the main sales volume of those chips, but additional cash for AMD with little effort.

And who knows, ML/AI research is very fast moving. Maybe those small and cheap to deploy LPDDRx cards could be ideal for an upcoming killer application we might not know yet. And not many companies will have a comparable product in their portfolio at that time.
We will see. And AMD has the luxury to wait and see, because the chips are designed and manufactured anyways.
 
Last edited:

gdansk

Diamond Member
Feb 8, 2011
4,570
7,682
136
Even if, in the worst case, the CU aren't WGP equivalent the purported top-end part should be beyond the 4090.

So that's all looking good if true.
 
  • Like
Reactions: Tlh97

MrMPFR

Member
Aug 9, 2025
103
207
71
Man that's a lot of words and none of them mean anything.
All you need to know is that cachemem setup is wayyyy different (again).
That's all.

LDS+L0 is boring, NVIDIA introduced that with Volta in 2017. Misleading NVIDIA's and AMD's implementation isn't 1/1 but the overall goal is probably the same as with Volta (see #1,201 and the Volta Tuning guide) Even if it goes beyond that still nothing new (M3 in 2023).

Yes they do. Read the patents, go beyond SOTA. Very novel and we might finally see AMD pioneer tech for once. Haven't seen that since GCN + Mantle.


Would be really nice if the cache hierarchy between AMD and NV finally standardized enough that, when writing a compute shader, you could assume that if it's performant on one that it will also be on the other.

Not just that but also to make AMD's optimisation work easier. This is prob another TeraScale -> GCN of making devs jobs easier. Workgraphs also ties directly into that so AMD will prob push it really hard nextgen.

Excerpt from Turing whitepaper:

"Figure 6 shows how the new combined L1 data cache and shared memory subsystem of the Turing SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance. Combining the L1 data cache with the shared memory reduces latency and provides higher bandwidth than the L1 cache implementation used previously in Pascal GPUs."
 
Last edited:

Makaveli

Diamond Member
Feb 8, 2002
4,975
1,571
136
It will need to be considerably faster than 5090 or it won't sell, it might have special edge in UE5 games which should be flooding the market by then
i' m not sure that is true.

a 5090 cost like $2800-$3500 CAD.

if they produce something 50% faster than a 7900XTX which would make it in between a 4090 and 5090 performance wise I would buy it in the $1500 range.

Not everyone is interested in paying 3k Cad or 2k+ USD for a gpu.

Price is very important.
 
  • Like
Reactions: Tlh97