Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 91 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

basix

Senior member
Oct 4, 2024
323
631
96
LP6-10667 offers higher bandwidth per channel (equal to hypothetical 16 GT/s LP5X), probably needs less voltage, and will be produced by all 3 memory manufacturers, so probably be cheaper per GB.
So even cutting the interface of AT3 to 75% width and go with 24GB LP6-10667@288bit might still be a better overall solution than 16GB of this "Ultra-Pro" (probably also Ultra-expensive) Samsung-only LP5X-12700.

LP6-10667@288bit would still give 480 GB/s, and 24GB has less risk of the PCIe-interface ever becoming a bottleneck.
Full config AT3 will likely perform around 9070 and has only 8x PCIe, so putting only 16GB on it may actually be risky.
Good point. With LPDDR6 you could use a narrower than full-spec memory bus width.

Anther option could be 12.8 Gbps at 240bit. Same bandwidth, only 20 GByte VRAM.
Are we sure it works that way?
We've had DeltaColorCompression and internal compression on GPUs with ongoing improvements for over a decade, but it never really reduced VRAM capacity requirements in any noticeable way, only bandwidth efficiency.
The only way it could reduce capacity needs would be if data is stored compressed even in VRAM.
If you do not reduce the asset size in DRAM, you do not save DRAM bandwidth.
That's what AMD implies will happen - decompression is a lot quicker than (good) compression, so doing compression once when placing asset in memory makes sense.
You also need to have a HW accelerated compressor. If you modify data and want to write it back to higher level caches or DRAM you need to compress it.

I think that's at 14.4 and not 10.7. (12 channel LPDDR6)
10.7 Gbps are correct. LPDDR6 is 1.5x wider than LPDDR5(X). With 14.4 Gbps at 384bit (256bit with LPDDR5) you would get 864 GB/s brutto (net is less due to encoding overheads).
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
17,171
7,549
136
10.7 Gbps are correct. LPDDR6 is 1.5x wider than LPDDR5(X).


This says 10.7 is 28 GB/sec effective per channel. So 12 channel would be 342 GB/sec.

Regardless it's clear that if any of these LPDDR6 dGPUs end up shipping, the marketing is going to be entirely around VRAM. Which could blow up in AMD's face if it ends up being more expensive than anticipated. Maybe AMD would just cancel if that ends up happening and ship the GDDR7 parts only.

Could still work out. I am not expecting much from the Rubin lower end cards and those will have less VRAM for sure.
 

branch_suggestion

Senior member
Aug 4, 2023
893
1,945
106
We don't know yet if desktop AT2 will get more than the 64 active CUs the leaked slide from MLID suggested, in that case it'd only be 33% more CUs.
AT2 is 70CU/35WGP(old).
4SE/8SA configured in 4/5, 4/5, 4/5, 4/4 (USR PHY is in the way presumably).
Config triggers me very much but BOM savings are vital in high volume parts.
 
  • Like
Reactions: Tlh97

basix

Senior member
Oct 4, 2024
323
631
96
The comboPHY is very much fixed-width.
Sure you can. Not by reducing the width of a channel, but by simply not using all channels. The same thing gets done on salvaged GPUs since ages ;)

Edit:
AT2 is 70CU/35WGP(old).
4SE/8SA configured in 4/5, 4/5, 4/5, 4/4 (USR PHY is in the way presumably).
Config triggers me very much but BOM savings are vital in high volume parts.
72 CU makes more sense to me. 24 per SE. Or more accuratly 36 / 12 CU in RDNA5 terms.
  • AT0 = 8 SE, 96 CU, 512bit GDDR7 @ 32 Gbps, ~2048 GB/s, 32 GByte or more
  • AT2 = 3 SE, 36 CU, 192bit GDDR7 @ 32 Gbps, ~768 GB/s, 18...24 GByte
  • AT3 = 2 SE, 24 CU, 288bit LPDDR6 @ 10.7 Gbps // 240bit LPDDR6 @ 12.8 Gbps, ~480 GB/s, 20...24 GByte
    • Mobile: 256bit LPDDR5X @ 8.5...9.6 Gbps, ~340...384 GB/s
  • AT4 = 1 SE, 12 CU, 128bit LPDDR5X @ 9.6...10.7 Gbps, ~192...210 GB/s, 12...16 GByte
    • Mobile: 128bit LPDDR5X @ 8.5...9.6 Gbps, ~170...192 GB/s
AT3 and AT4 are difficult to pin down regarding DRAM interface. Fewer DRAM packages due to LPDDR6 might be nice (smaller PCB area occupied). But you could partly solve that by using x64 instead of x32 packages.
 
Last edited:
  • Like
Reactions: Tlh97 and marees

reaperrr3

Member
May 31, 2024
174
498
96
Assumed it had an impact due to poor 9070 -> 9070XT perf scaling is in raster games. ~12% at 4K according to TPU. That's only half of ~25% compute gain (based on TPU avg. game clock).
RT games and blender show bigger increases but still only ~15% avg and ~18% respectively.
Guess the issue is somewhere else.
Can be all kinds of things:
- slight CPU / driver overhead limitations
- slight command processor limitations
- Primitive / geometry throughput (tied to SE count)
- L1/L2/L3/mem bw/capacity holding back the XT a bit more, the latter might've needed GDDR7 to fully stretch its legs
- TPU's game selection is a bit meh

Keep in mind that some things like the driver/CPU, cache and mem bw are soft limits. They don't hard-cap your FPS, rather some frames are just computed slightly slower than they'd be under ideal circumstances.

Perf scaling of adding CUs/SMs has been at only ~60-75% since forever.

Also, according to computerbase.de, the 9070XT is actually 14% faster than the 9070 at 4K (even 16% in RT), with only ~14.3% more CUs.
For comparison, the 5080 is only 15% faster than the 5070Ti in the same test (only 12% in RT), despite 20% more SM, 33% more L2 and slightly faster VRAM vs the latter.
 
  • Like
Reactions: Tlh97

MrMPFR

Senior member
Aug 9, 2025
226
427
96
Can be all kinds of things:
- slight CPU / driver overhead limitations
- slight command processor limitations
- Primitive / geometry throughput (tied to SE count)
- L1/L2/L3/mem bw/capacity holding back the XT a bit more, the latter might've needed GDDR7 to fully stretch its legs
- TPU's game selection is a bit meh
- 5090 scaled far higher
- Possible but 9060XT -> 9070XT showed better perf scaling. Maybe 14 -> 16 CUs + higher clocks where issue arises.
- IDK
- Possible.
- If you know someone else that compiles entire stack for averages let me know.

Perf scaling of adding CUs/SMs has been at only ~60-75% since forever.
Clock adjusted 9060XT -> 9070XT falls short by only ~5%. +90% TFLOP/FPS scaling at 4K.

Localizing basically everything to SEs in RDNA5 should solve that.

Also, according to computerbase.de, the 9070XT is actually 14% faster than the 9070 at 4K (even 16% in RT), with only ~14.3% more CUs.
Factor in clockspeed as well. Still nowhere near 25%.TPU had higher averages for RT as well.

For comparison, the 5080 is only 15% faster than the 5070Ti in the same test (only 12% in RT), despite 20% more SM, 33% more L2 and slightly faster VRAM vs the latter.
Yeah NVIDIA has serious problems with core scaling.
 
  • Like
Reactions: Tlh97 and Saylick