Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 89 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

marees

Platinum Member
Apr 28, 2024
2,221
2,865
96
Yes, that’s correct. LLVM has a roughly 6-month release cycle. If AMD misses contributing or integrating for one release, the next official release opportunity would be about 6 months later. So missing a deadline can introduce a half-year delay for official inclusion.
so the question is what did they hope to achieve in these 6 months of H1-2026

Some testing of hardware ?? why does that need LLVM updates ???
 

Win2012R2

Golden Member
Dec 5, 2024
1,323
1,363
96
Some testing of hardware ?? why does that need LLVM updates ???
Perfecting drivers? Only I don't think they committed anywhere near the stuff for this, maybe it's just somebody to collect bonus for shipping LLVM enablement on time and budget.
 

marees

Platinum Member
Apr 28, 2024
2,221
2,865
96

MrMPFR

Senior member
Aug 9, 2025
203
400
96
Missed the interesting TBIMR thread in Zen6 thread so will I'll move some of it here:

Well gfx13+ are TBIMR
Yup patents (see next post) confirm.

As is everything Nvidia since Maxwell.
You're still have no real tile storage (or a real tiler) to distribute screenspace across.
Seems like AMD's implementation is more advanced.

It's entirely possible AMD's tiling/binning still isn't on Nvidia's level and that AMD won't have true TBIMR until gfx13.
IDK if it's true TBIMR but looks improved.

If gfx13 makes good improvent towards a (better) TBIMR we could expect decent efficiency gains (energy and memory bandwidth).
It does look promising, but I can only repeat what the patents say (see next post), not what it means and what the implications are.
 

MrMPFR

Senior member
Aug 9, 2025
203
400
96
TBIMR patents:
Found these patents a while ago but were note sure what to do with them. Now that Kepler has confirmed I'll drop them here. Seems like one of Chris Brennan's last RDNA5 efforts before he left AMD and joined Meta:

#1 TBIMR:
- PPC buffers, per-tile queues managed by FF HW, reduced ressource use, and processing time
https://patentscope.wipo.int/search/en/detail.jsf?docId=US464435295

#2 TBIMR + pixel circuitry balancing:
- Improved load balancing
https://patentscope.wipo.int/search/en/detail.jsf?docId=US464435313

#3 TBIMR + per-tile depth pre-passes
- No need to repeat assembly and shading of primitives
https://patentscope.wipo.int/search/en/detail.jsf?docId=US464435298

#4 TBIMR + SE localized geometry + deferred attribute shading
- "reducing unnecessary computations and memory bandwidth usage"
https://patentscope.wipo.int/search/en/detail.jsf?docId=US464101837

TLPBB a thing?
There's also another rendering pipeline efficiency boosting effort called TLPBB (Two-Level Primitive Batch Binning) that has five public patents but won't list these here unless Kepler confirms.
 

basix

Senior member
Oct 4, 2024
310
607
96
  • Like
Reactions: MrMPFR

MrMPFR

Senior member
Aug 9, 2025
203
400
96
Isn't that the old news regarding chiplet based GPUs?

As far as I understand TLPBB and TBIMR could be combined.
Seems like the new TLPBB patents expand on the underlying idea significantly and has been modified to enhance the TBIMR pipeline. But as a standalone thing in general it seems to achieve the same goal: reduce unneccessary computations and memory bandwidth usage.
Here's one of them: https://patentscope.wipo.int/search/en/detail.jsf?docId=US425302144

In the mean time here's an interesting post from Imagination comparing their TBDR vs old school IMR (pre-Maxwell):
It seems like the proposed design from the TBIMR patents is a lot closer to TBDR, albeit still without the strict requirements and characterstics, and does look quite different from the simple TBIMR designs we've seen so far.

That is my understanding as well. As I said it'll enhance the TBIMR pipeline by feeding it better inputs while also making the TLPBB pipeline compatible with deferred attribute shading and chiplet friendly designs (local SE geo + pixel pipelines). The TBIMR pipeline already looks like a significant step up in efficiency, but the as I said the five TLPBB patents further enhance this.
It will be interested to see just how large the perf/watt and perf/BW impact of these changes are but if it's anywhere close to TBDR characteristics then that's a massive win.

Just to be certain it'll reiterate that TLPBB hasn't been confirmed unlike TBIMR. There are also many more related patents, that could improve further upon the design, but I won't flood the thread with them.

Have to note that this is based on my limited surface level understanding. Maybe @basix or someone else can do a better job at explaining what those patents achieve?
 
Last edited:

basix

Senior member
Oct 4, 2024
310
607
96
All market players agree that memory bandwidth and data movement (energy efficiency) is the one major concern regarding making chips faster (besides of slowed down Moore's Law). So developing technologies or at least enhancements of existing stuff with such a focus makes very much sense (TBIMR, TBDR, TLPBB, shared L0/L1 caches, universal compression, ...).
 
  • Like
Reactions: MrMPFR

MrMPFR

Senior member
Aug 9, 2025
203
400
96
All market players agree that memory bandwidth and data movement (energy efficiency) is the one major concern regarding making chips faster (besides of slowed down Moore's Law). So developing technologies or at least enhancements of existing stuff with such a focus makes very much sense (TBIMR, TBDR, TLPBB, shared L0/L1 caches, universal compression, ...).
More like the death of Moore's Law for caches and memory. PHY/analog scaling died multiple gens ago done and cache scaling is moving at snail pace rn.

I've rewritten #2,209 after taking a closer look at the TBIMR + TLPBB patent derived implementation. In terms of efficiency it's likely closer to a TBDR implementation than partial TBIMR (Maxwell and later). This is very impressive and not what I expected at all.

In some of the related patents there's also a common theme of dedicated logic to get rid of unused data, deallocations without writing discarded data back to memory, and in addition a method for overwriting the same physical page multiple times to save on cache and bandwidth use. The last one seems like a great fit for a tile based renderer. The same physical page can be reused many times and overwriting it instead of doing expensive cache line invalidations. This is just smarter.
Hope that the design implements these changes as well, especially the last one. We'll see but GFX13 looks very impressive.
 
Last edited:
  • Like
Reactions: marees

marees

Platinum Member
Apr 28, 2024
2,221
2,865
96
what will be the price of a medusa premium NUC ? (It is effeectively the specs of a 2027 steam machine)

posted on pantherlake thread with more detail,

you can get Halo 392 full laptop 'ASUS TUF' for around 1500 bucks now

pantherlake machines START from 1500, up to 2500. DOA
 

Tachyonism

Junior Member
Jan 24, 2026
8
15
36
Whatever RTX6050 laptops will be targeting -$100.
They're discrete compete parts, 45/85W respectively.
If so then wont people just buy the 6050 laptop, since AT4 is expected to be between 3060 and 4060 only? According to you, the Medusa Premium (AT4) would come with a premium pricing, whereas the 6050 is just a bottom-tier chip.
1769658406186.png
 
  • Like
Reactions: marees

basix

Senior member
Oct 4, 2024
310
607
96
Depending on clock rates, IPC uplift and bandwidth efficiency of RDNA5, I expect the upper performance ceiling of AT4 to be at around RX 9060 / RTX 5060 and maybe even 9060 XT level. 24 vs. 28/32 CU. With the additional benefit, that AT4 is not a 8GB GPU ;)
I am talking about desktop performance levels here (and AT4 in that form factor as well). AT4 in mobile form factor will be slower, so 4060 desktop might be a good guess.
 
  • Like
Reactions: marees

marees

Platinum Member
Apr 28, 2024
2,221
2,865
96
Depending on clock rates, IPC uplift and bandwidth efficiency of RDNA5, I expect the upper performance ceiling of AT4 to be at around RX 9060 / RTX 5060 and maybe even 9060 XT level. 24 vs. 28/32 CU. With the additional benefit, that AT4 is not a 8GB GPU ;)
I am talking about desktop performance levels here (and AT4 in that form factor as well). AT4 in mobile form factor will be slower, so 4060 desktop might be a good guess.
AMD's plan could well be to run ROCM on medusa premium & hawk it for $1500 — until open AI goes bankrupt
 

Kepler_L2

Golden Member
Sep 6, 2020
1,079
4,646
136
Depending on clock rates, IPC uplift and bandwidth efficiency of RDNA5, I expect the upper performance ceiling of AT4 to be at around RX 9060 / RTX 5060 and maybe even 9060 XT level. 24 vs. 28/32 CU. With the additional benefit, that AT4 is not a 8GB GPU ;)
I am talking about desktop performance levels here (and AT4 in that form factor as well). AT4 in mobile form factor will be slower, so 4060 desktop might be a good guess.
Yeah 9060 XT perf at desktop power levels should be possible. For Medusa Premium it would be lower due to power limits and also sharing mem BW with CPU.
 

MrMPFR

Senior member
Aug 9, 2025
203
400
96
All market players agree that memory bandwidth and data movement (energy efficiency) is the one major concern regarding making chips faster (besides of slowed down Moore's Law). So developing technologies or at least enhancements of existing stuff with such a focus makes very much sense (TBIMR, TBDR, TLPBB, shared L0/L1 caches, universal compression, ...).
Unless RDNA4 has TBB and TLPBB it looks like Xclipse 960 maybe exceeds baseline and could have some significant customizations. From LinkedIn:
• Worked on the PBB (Primitive Batch Binning) module
Increases cache hit rate by using the spatial locality across primitives in a given screen space
• Worked on Two-Level PBB (TLPBB) – a mobile GPU-friendly feature
Samsung also filed patents for TBB and TLPBB.
GFX13 def wants to add both + the additional improvements from various TLPBB patent filings.
 
  • Like
Reactions: marees

Tachyonism

Junior Member
Jan 24, 2026
8
15
36
who said that.
MLID, that's the only source I have. If you have better source, show it.
6050 starts at above-kilobuck laptops usually.
$1000 is still mid-range pricing (and even low-tier pricing for gaming laptops). If it's something "premium pricing", I imagine it to be $1500, and in that price bracket, you should be able to get a xx60 laptop too.
 
  • Haha
  • Like
Reactions: marees and Bigos

MrMPFR

Senior member
Aug 9, 2025
203
400
96
Yeah 9060 XT perf at desktop power levels should be possible. For Medusa Premium it would be lower due to power limits and also sharing mem BW with CPU.
How is that possible with only one SE (AT4) vs two SEs (Navi 44)? More ROPs and bigger Rasterizer?

If true then Navi 44 -> 48 extrapolation suggests AT3 dGPU = ~9070 - 9070XT.

Guess AT2 dGPU ~4090 is also possible?
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,457
11,186
106

Kepler_L2

Golden Member
Sep 6, 2020
1,079
4,646
136
How is that possible with only one SE (AT4) vs two SEs (Navi 44)? More ROPs and bigger Rasterizer?

If true then Navi 44 -> 48 extrapolation suggests AT3 dGPU = ~9070 - 9070XT.

Guess AT2 dGPU ~4090 is also possible?
Modern games are rarely ROP-limited
 

MrMPFR

Senior member
Aug 9, 2025
203
400
96
Modern games are rarely ROP-limited
Assumed it had an impact due to poor 9070 -> 9070XT perf scaling is in raster games. ~12% at 4K according to TPU. That's only half of ~25% compute gain (based on TPU avg. game clock).
RT games and blender show bigger increases but still only ~15% avg and ~18% respectively.
Guess the issue is somewhere else.

Now with -8CU (-25%), +50% size SE, halved RB and Rasterizer (-50%), no MALL, L2 ?<32MB, LPDDR5X/?LPDDR6 weak memory interface.
1/(-25% WGP (0.75) x +15% freq (1.15)) = +16% IPC when compute bound. Higher in raster games with -50% RB + Rasterizer offset (unless bigger HW).

So again impressive if AT4 dGPU achieves 9060XT raster perf.