Discussion RDNA 5 / UDNA (CDNA Next) speculation

basix · Jun 28, 2025

I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

I would suspect, that the cooperative vector API greatly reduce the transition overhead between matrix cores and other parts of a CU. And in the end, everything uses the same registers and caches of a CU or SM. So the main thing you need to care about is data alignment (vectors vs. matrices) that you do not need to shuffle around your data when switching between vectors and matrices. How big the actual benefits will be, have to be seen. I hope we see some talks and presentations about cooperative vectors from AMD, Nvidia and game developers.

For me, there is another reason for matrices:
Optimize performance in general. Game developers used vector math for ages. Now they could reshape their algorithms to do direct matrix math. Will it be faster? It depends on the use case. But I could very well imagine, that it would allow developers to push further. As you said, with matrix operations you optimize for bandwidth and power. Both is scarce on GPUs if you want to get optimal performance. Will that take some effort and time? Sure.

I do not see it happening too soon, earliest with the next console cycle because those will support WMMA acceleration. If you want to squeeze the maximum out of a console, optimize your code and increase the utilization of the available hardware units. Most image filters kernels (e.g. Lanczos) use matrices (e.g. postprocessing in games), vector*matrix for orientation & transformation of things, dot-products for geometric stuff and so on. If the data is aligned right, you can put that into vectors or matrices, the result is the same.

Bigos · Jun 28, 2025

basix said:
I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

The matrices used in games are 4x4 at most and these are not the target of the tensor units. And the matrix x vector computations are not parallelized since terascale era.

What are you talking about?

511 · Jun 28, 2025

basix said:
I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

I would suspect, that the cooperative vector API greatly reduce the transition overhead between matrix cores and other parts of a CU. And in the end, everything uses the same registers and caches of a CU or SM. So the main thing you need to care about is data alignment (vectors vs. matrices) that you do not need to shuffle around your data when switching between vectors and matrices. How big the actual benefits will be, have to be seen. I hope we see some talks and presentations about cooperative vectors from AMD, Nvidia and game developers.

For me, there is another reason for matrices:
Optimize performance in general. Game developers used vector math for ages. Now they could reshape their algorithms to do direct matrix math. Will it be faster? It depends on the use case. But I could very well imagine, that it would allow developers to push further. As you said, with matrix operations you optimize for bandwidth and power. Both is scarce on GPUs if you want to get optimal performance. Will that take some effort and time? Sure.

I do not see it happening too soon, earliest with the next console cycle because those will support WMMA acceleration. If you want to squeeze the maximum out of a console, optimize your code and increase the utilization of the available hardware units. Most image filters kernels (e.g. Lanczos) use matrices (e.g. postprocessing in games), vector*matrix for orientation & transformation of things, dot-products for geometric stuff and so on. If the data is aligned right, you can put that into vectors or matrices, the result is the same.

Second this evern a basic triangle which are used to render stuff even something as basic as where the point is inside the triangle or not require matrix operation like determinant of a matricd.

menhera · Jun 28, 2025

I've noticed ray intersection (and traversal) performance in my 9070 XT is directly proportional to L0 cache hitrates.

AMD really should merge two separate L0 caches per WGP in UDNA.

Tuna-Fish · Jun 28, 2025

511 said:
Second this evern a basic triangle which are used to render stuff even something as basic as where the point is inside the triangle or not require matrix operation like determinant of a matricd.

What Bigos said, yes everything is matrix math but also the matrices are all 4x4. The tensor units are optimized for much, much larger matrices.

Kepler_L2 · Jun 28, 2025

menhera said:
View attachment 126346

I've noticed ray intersection (and traversal) performance in my 9070 XT is directly proportional to L0 cache hitrates.

View attachment 126348AMD really should merge two separate L0 caches per WGP in UDNA.

They did (if gfx12.5 is any indication).

marees · Jun 30, 2025

AMD to split flagship AI GPUs into specialized lineups for AI and HPC, add UALink — Instinct MI400-series models takes a different path

Starting from its next-generation Instinct MI400-series, AMD will offer distinct processors for AI and supercomputers in a bid to maximize performance for each workload, according to SemiAnalysis. However, there might be a problem with the scalability of these compute GPUs.

AMD plans to offer the Instinct MI450X for AI and the Instinct MI430X for HPC sometime in the second half of 2026. Both processors will rely on subsets of the CDNA Next architecture, but will be tailored for low-precision AI compute (FP4, FP8, BF16) or high-precision HPC compute (FP32, FP64). Such bifurcation of positioning will enable AMD to remove FP32 and FP64 logic from MI450X as well as FP4, FP8, and BF16 logic from MI430X, therefore maximizing die space for respective logic.

In addition to workload optimizations, AMD's Instinct MI400-series accelerators will also feature not only Infinity Fabric but also UALink interconnections, which will make them some of the first AI and HPC GPUs to feature UALink, a technology designed to challenge NVLink. But there is a major problem with UALink.

Support for UALink will be limited in 2026 due to the absence of switching silicon from external vendors, including Astera Labs, Auradine, Enfabrica, and XConn. As a result, the Instinct MI430X will only be usable in small configurations in topologies like mesh or torus, as there will be no UALink switches next year. AMD does not develop its own UALink switches and therefore relies entirely on partners, which may not be ready in the second half of next year.

Progress in UALink development has been slow due to coordination delays in the standards body. According to SemiAnalysis, chipmakers like Broadcom view the market for such switches as too small and are not assigning enough engineering resources to accelerate timelines. By contrast, networking initiatives under the Ultra Ethernet Consortium are advancing more quickly and already have compatible hardware available commercially.

AMD to split flagship AI GPUs into specialized lineups for AI and HPC, add UALink — Instinct MI400-series models takes a different path

But there is a major catch.

www.tomshardware.com

moinmoin · Jun 30, 2025

I wonder whether the differentiation between MI450X and MI430X is done purely on a SKU level or if there instead are dedicated dies involved.

jpiniero · Jun 30, 2025

moinmoin said:
I wonder whether the differentiation between MI450X and MI430X is done purely on a SKU level or if there instead are dedicated dies involved.

It says as much in the article.

moinmoin · Jun 30, 2025

jpiniero said:
It says as much in the article.

SemiAnalysis (the source) only writes they "believe" there will be different SKUs:

https://twitter.com/x/status/1922430255295353051

Tom's Hardware's "article" rambles on that and claims the design would be "bifurcated" and "maximize die space for respective logic":

Such bifurcation of positioning will enable AMD to remove FP32 and FP64 logic from MI450X as well as FP4, FP8, and BF16 logic from MI430X, therefore maximizing die space for respective logic.

So you take the talk about SKUs at face value and think AMD uses the same die? Or do you take Tom's Hardware "article's" take and claim yourself it's different dies?

jpiniero · Jun 30, 2025

I would have to think it would be different dies. It's probally the same design except FP64 cores versus FP4/FP8/FP16.

basix · Jun 30, 2025

Maybe like that?
- MI450X = 1x FP64, 1x Low Precision
- MI430X = 2x FP64, 0.5x Low Precision

menhera said:
AMD really should merge two separate L0 caches per WGP in UDNA.

Kepler_L2 said:
They did (if gfx12.5 is any indication).

Very nice! Any link to that info? Or did they just double L0$ capacity in general?

As an additional note:
There was a paper from 2020 (some university together with AMD) which proposed to share L1-Caches (L0 on RDNA) across the GPU. I would propose to narrow the shared area down to one Shader Engine but anyways, especially some ML/AI workloads benefited heavily. I could imagine, this could translate to raytracing as well. In effect this would be a very nice win-win situation for ML/AI in HPC GPUs (CDNA5) and RT/ML/AI in gaming GPUs (RDNA5) and also one of the pivots towards UDNA.

https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

Kepler_L2 · Jun 30, 2025

moinmoin said:
I wonder whether the differentiation between MI450X and MI430X is done purely on a SKU level or if there instead are dedicated dies involved.

Different XCDs

moinmoin said:
SemiAnalysis (the source) only writes they "believe" there will be different SKUs:

https://twitter.com/x/status/1922430255295353051

Tom's Hardware's "article" rambles on that and claims the design would be "bifurcated" and "maximize die space for respective logic":

So you take the talk about SKUs at face value and think AMD uses the same die? Or do you take Tom's Hardware "article's" take and claim yourself it's different dies?

Nothing is removed as that would break binary compatibility, just there are different Flops ratios.

basix said:
Maybe like that?
- MI450X = 1x FP64, 1x Low Precision
- MI430X = 2x FP64, 0.5x Low Precision

Very nice! Any link to that info? Or did they just double L0$ capacity in general?

As an additional note:
There was a paper from 2020 (some university together with AMD) which proposed to share L1-Caches (L0 on RDNA) across the GPU. I would propose to narrow the shared area down to one Shader Engine but anyways, especially some ML/AI workloads benefited heavily. I could imagine, this could translate to raytracing as well. In effect this would be a very nice win-win situation for ML/AI in HPC GPUs (CDNA5) and RT/ML/AI in gaming GPUs (RDNA5) and also one of the pivots towards UDNA.

https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

LLVM patches, WGP mode is now the standard with CU mode only existing as a legacy fallback.

basix · Jun 30, 2025

I mean, AMDs chiplet architecture is perfectly suited for different XCDs. Much better than a big 800mm2 Die, as the XCDs will probably stick between ~130...170mm2.

Hmm, wenn looking at that XCD Die size: Could be well in the range of the 32C Zen 6 Chiplet. So we might see a 8*32C = 256C MI400C variant.

adroc_thurston · Jun 30, 2025

basix said:
Hmm, wenn looking at that XCD Die size: Could be well in the range of the 32C Zen 6 Chiplet. So we might see a 8*32C = 256C MI400C variant.

oh no, Z6d CCD is not made for 3D stacking at all.

basix · Jun 30, 2025

But it could. Zen 4 showed us, that the area overhead is minimal (MI300C).

But anyways: Taking 1x 32C or 3x 12C does not make a huge difference in core count. But the 12C chiplets could be clocked higher.

adroc_thurston · Jun 30, 2025

basix said:
Zen 4 showed us, that the area overhead is minimal (MI300C).

That's minimal area overhead for baby bandwidth.
Z6d CCDs are far far higher wrt d2d oomph.

SolidQ · Jul 2, 2025

AI push. Matrix Cores?

PS5 Pro is getting a big upgrade in 2026 — I asked Mark Cerny what’s coming, and why AMD’s future PC GPUs feel more 'PlayStation' than ever

Sony's Mark Cerny explains how AMD and Project Amethyst contributed to the PS5 Pro's impending 2026 upgrade

www.tomsguide.com

Bryo4321 · Jul 2, 2025

SolidQ said:
AI push. Matrix Cores?

PS5 Pro is getting a big upgrade in 2026 — I asked Mark Cerny what’s coming, and why AMD’s future PC GPUs feel more 'PlayStation' than ever

Sony's Mark Cerny explains how AMD and Project Amethyst contributed to the PS5 Pro's impending 2026 upgrade

www.tomsguide.com

Just sounds like fsr4. He basically said as much.

“"The algorithm they came up with could be implemented on current-generation hardware," said Cerny. "So the co-developed algorithm has already been released by AMD as part of FSR 4 on PC. And we're in the process of implementing it on PS5 and it will release next year on PS5 Pro."

SolidQ · Jul 2, 2025

Bryo4321 said:
Just sounds like fsr4. He basically said as much.

i mean about PS6

basix · Jul 2, 2025

Cerny mentioned, that they want to integrate a more general purpose CNN/DNN hardware architecture for PS6. That are essentially matrix cores as you can find in RDNA4 already, indeed.

dr1337 · Jul 6, 2025

If they do have two dies for UDNA maybe the AI die is also useable as a flagship gaming GPU and radeon pro. FP64 is already cut down in radeon cards as is, and it would make them more cost effective since dies get more re-use.

adroc_thurston · Jul 6, 2025

dr1337 said:
If they do have two dies for UDNA

No, that's for MI400 (CDNA5).

dr1337 said:
maybe the AI die is also useable as a flagship gaming GPU and radeon pro

it does not can has into graphics.
GFX scheduling ring mucho noexisto there.

Kepler_L2 · Jul 6, 2025

dr1337 said:
If they do have two dies for UDNA maybe the AI die is also useable as a flagship gaming GPU and radeon pro. FP64 is already cut down in radeon cards as is, and it would make them more cost effective since dies get more re-use.

MI450X doesn't have any texture/geometry/RT engine stuff, it also doesn't have any of the gfx13 goodies like SWC/WGS/DGF.

ToTTenTranz · Jul 7, 2025

basix said:
Cerny mentioned, that they want to integrate a more general purpose CNN/DNN hardware architecture for PS6.

They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Member

Member

Diamond Member

Junior Member

Golden Member

Senior member

Golden Member

AMD to split flagship AI GPUs into specialized lineups for AI and HPC, add UALink — Instinct MI400-series models takes a different path​

Diamond Member

Lifer

Diamond Member

Lifer

Member

Senior member

Member

Diamond Member

Member

Diamond Member

Golden Member

Member

Golden Member

Member

Senior member

Diamond Member

Senior member

Senior member

AMD to split flagship AI GPUs into specialized lineups for AI and HPC, add UALink — Instinct MI400-series models takes a different path