Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 19 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

basix

Member
Oct 4, 2024
163
322
96
I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

I would suspect, that the cooperative vector API greatly reduce the transition overhead between matrix cores and other parts of a CU. And in the end, everything uses the same registers and caches of a CU or SM. So the main thing you need to care about is data alignment (vectors vs. matrices) that you do not need to shuffle around your data when switching between vectors and matrices. How big the actual benefits will be, have to be seen. I hope we see some talks and presentations about cooperative vectors from AMD, Nvidia and game developers.

For me, there is another reason for matrices:
Optimize performance in general. Game developers used vector math for ages. Now they could reshape their algorithms to do direct matrix math. Will it be faster? It depends on the use case. But I could very well imagine, that it would allow developers to push further. As you said, with matrix operations you optimize for bandwidth and power. Both is scarce on GPUs if you want to get optimal performance. Will that take some effort and time? Sure.

I do not see it happening too soon, earliest with the next console cycle because those will support WMMA acceleration. If you want to squeeze the maximum out of a console, optimize your code and increase the utilization of the available hardware units. Most image filters kernels (e.g. Lanczos) use matrices (e.g. postprocessing in games), vector*matrix for orientation & transformation of things, dot-products for geometric stuff and so on. If the data is aligned right, you can put that into vectors or matrices, the result is the same.
 
  • Haha
Reactions: Bigos

Bigos

Member
Jun 2, 2019
199
515
136
I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

The matrices used in games are 4x4 at most and these are not the target of the tensor units. And the matrix x vector computations are not parallelized since terascale era.

What are you talking about?
 
  • Like
Reactions: marees

511

Diamond Member
Jul 12, 2024
3,220
3,160
106
I cannot tell you exactly, how intense matrix math is in game engines. But it is for certain, that matrices get used everywhere in games (you can google that if you want). And today this means, that you make N-times vector * vector math instead of 1-time matrix * vector. For some part this split into multiple vectors is useful, because it allows for easy parallelization on wide SIMD units of a GPU. But you can do that with matrices as well, because you have millions of pixels anyways.

I would suspect, that the cooperative vector API greatly reduce the transition overhead between matrix cores and other parts of a CU. And in the end, everything uses the same registers and caches of a CU or SM. So the main thing you need to care about is data alignment (vectors vs. matrices) that you do not need to shuffle around your data when switching between vectors and matrices. How big the actual benefits will be, have to be seen. I hope we see some talks and presentations about cooperative vectors from AMD, Nvidia and game developers.

For me, there is another reason for matrices:
Optimize performance in general. Game developers used vector math for ages. Now they could reshape their algorithms to do direct matrix math. Will it be faster? It depends on the use case. But I could very well imagine, that it would allow developers to push further. As you said, with matrix operations you optimize for bandwidth and power. Both is scarce on GPUs if you want to get optimal performance. Will that take some effort and time? Sure.

I do not see it happening too soon, earliest with the next console cycle because those will support WMMA acceleration. If you want to squeeze the maximum out of a console, optimize your code and increase the utilization of the available hardware units. Most image filters kernels (e.g. Lanczos) use matrices (e.g. postprocessing in games), vector*matrix for orientation & transformation of things, dot-products for geometric stuff and so on. If the data is aligned right, you can put that into vectors or matrices, the result is the same.
Second this evern a basic triangle which are used to render stuff even something as basic as where the point is inside the triangle or not require matrix operation like determinant of a matricd.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,651
2,485
136
Second this evern a basic triangle which are used to render stuff even something as basic as where the point is inside the triangle or not require matrix operation like determinant of a matricd.

What Bigos said, yes everything is matrix math but also the matrices are all 4x4. The tensor units are optimized for much, much larger matrices.
 
  • Like
Reactions: marees

marees

Golden Member
Apr 28, 2024
1,301
1,865
96

AMD to split flagship AI GPUs into specialized lineups for AI and HPC, add UALink — Instinct MI400-series models takes a different path​


Starting from its next-generation Instinct MI400-series, AMD will offer distinct processors for AI and supercomputers in a bid to maximize performance for each workload, according to SemiAnalysis. However, there might be a problem with the scalability of these compute GPUs.

AMD plans to offer the Instinct MI450X for AI and the Instinct MI430X for HPC sometime in the second half of 2026. Both processors will rely on subsets of the CDNA Next architecture, but will be tailored for low-precision AI compute (FP4, FP8, BF16) or high-precision HPC compute (FP32, FP64). Such bifurcation of positioning will enable AMD to remove FP32 and FP64 logic from MI450X as well as FP4, FP8, and BF16 logic from MI430X, therefore maximizing die space for respective logic.



In addition to workload optimizations, AMD's Instinct MI400-series accelerators will also feature not only Infinity Fabric but also UALink interconnections, which will make them some of the first AI and HPC GPUs to feature UALink, a technology designed to challenge NVLink. But there is a major problem with UALink.

Support for UALink will be limited in 2026 due to the absence of switching silicon from external vendors, including Astera Labs, Auradine, Enfabrica, and XConn. As a result, the Instinct MI430X will only be usable in small configurations in topologies like mesh or torus, as there will be no UALink switches next year. AMD does not develop its own UALink switches and therefore relies entirely on partners, which may not be ready in the second half of next year.

Progress in UALink development has been slow due to coordination delays in the standards body. According to SemiAnalysis, chipmakers like Broadcom view the market for such switches as too small and are not assigning enough engineering resources to accelerate timelines. By contrast, networking initiatives under the Ultra Ethernet Consortium are advancing more quickly and already have compatible hardware available commercially.


 
  • Like
Reactions: RnR_au and Tlh97

moinmoin

Diamond Member
Jun 1, 2017
5,235
8,443
136
I wonder whether the differentiation between MI450X and MI430X is done purely on a SKU level or if there instead are dedicated dies involved.
 
  • Like
Reactions: Tlh97 and marees

moinmoin

Diamond Member
Jun 1, 2017
5,235
8,443
136
It says as much in the article.
SemiAnalysis (the source) only writes they "believe" there will be different SKUs:

Tom's Hardware's "article" rambles on that and claims the design would be "bifurcated" and "maximize die space for respective logic":
Such bifurcation of positioning will enable AMD to remove FP32 and FP64 logic from MI450X as well as FP4, FP8, and BF16 logic from MI430X, therefore maximizing die space for respective logic.

So you take the talk about SKUs at face value and think AMD uses the same die? Or do you take Tom's Hardware "article's" take and claim yourself it's different dies?
 
  • Like
Reactions: Mopetar and marees

basix

Member
Oct 4, 2024
163
322
96
Maybe like that?
- MI450X = 1x FP64, 1x Low Precision
- MI430X = 2x FP64, 0.5x Low Precision

AMD really should merge two separate L0 caches per WGP in UDNA.
They did (if gfx12.5 is any indication).
Very nice! Any link to that info? Or did they just double L0$ capacity in general?

As an additional note:
There was a paper from 2020 (some university together with AMD) which proposed to share L1-Caches (L0 on RDNA) across the GPU. I would propose to narrow the shared area down to one Shader Engine but anyways, especially some ML/AI workloads benefited heavily. I could imagine, this could translate to raytracing as well. In effect this would be a very nice win-win situation for ML/AI in HPC GPUs (CDNA5) and RT/ML/AI in gaming GPUs (RDNA5) and also one of the pivots towards UDNA.
 

Kepler_L2

Senior member
Sep 6, 2020
922
3,771
136
I wonder whether the differentiation between MI450X and MI430X is done purely on a SKU level or if there instead are dedicated dies involved.
Different XCDs
SemiAnalysis (the source) only writes they "believe" there will be different SKUs:

Tom's Hardware's "article" rambles on that and claims the design would be "bifurcated" and "maximize die space for respective logic":


So you take the talk about SKUs at face value and think AMD uses the same die? Or do you take Tom's Hardware "article's" take and claim yourself it's different dies?
Nothing is removed as that would break binary compatibility, just there are different Flops ratios.
Maybe like that?
- MI450X = 1x FP64, 1x Low Precision
- MI430X = 2x FP64, 0.5x Low Precision



Very nice! Any link to that info? Or did they just double L0$ capacity in general?

As an additional note:
There was a paper from 2020 (some university together with AMD) which proposed to share L1-Caches (L0 on RDNA) across the GPU. I would propose to narrow the shared area down to one Shader Engine but anyways, especially some ML/AI workloads benefited heavily. I could imagine, this could translate to raytracing as well. In effect this would be a very nice win-win situation for ML/AI in HPC GPUs (CDNA5) and RT/ML/AI in gaming GPUs (RDNA5) and also one of the pivots towards UDNA.
LLVM patches, WGP mode is now the standard with CU mode only existing as a legacy fallback.
 

basix

Member
Oct 4, 2024
163
322
96
I mean, AMDs chiplet architecture is perfectly suited for different XCDs. Much better than a big 800mm2 Die, as the XCDs will probably stick between ~130...170mm2.

Hmm, wenn looking at that XCD Die size: Could be well in the range of the 32C Zen 6 Chiplet. So we might see a 8*32C = 256C MI400C variant.
 

basix

Member
Oct 4, 2024
163
322
96
But it could. Zen 4 showed us, that the area overhead is minimal (MI300C).

But anyways: Taking 1x 32C or 3x 12C does not make a huge difference in core count. But the 12C chiplets could be clocked higher.
 

Bryo4321

Member
Dec 5, 2024
63
124
66
AI push. Matrix Cores?
Just sounds like fsr4. He basically said as much.

“"The algorithm they came up with could be implemented on current-generation hardware," said Cerny. "So the co-developed algorithm has already been released by AMD as part of FSR 4 on PC. And we're in the process of implementing it on PS5 and it will release next year on PS5 Pro."
 

basix

Member
Oct 4, 2024
163
322
96
Cerny mentioned, that they want to integrate a more general purpose CNN/DNN hardware architecture for PS6. That are essentially matrix cores as you can find in RDNA4 already, indeed.
 
Last edited:

dr1337

Senior member
May 25, 2020
483
773
136
If they do have two dies for UDNA maybe the AI die is also useable as a flagship gaming GPU and radeon pro. FP64 is already cut down in radeon cards as is, and it would make them more cost effective since dies get more re-use.
 

Kepler_L2

Senior member
Sep 6, 2020
922
3,771
136
If they do have two dies for UDNA maybe the AI die is also useable as a flagship gaming GPU and radeon pro. FP64 is already cut down in radeon cards as is, and it would make them more cost effective since dies get more re-use.
MI450X doesn't have any texture/geometry/RT engine stuff, it also doesn't have any of the gfx13 goodies like SWC/WGS/DGF.
 

ToTTenTranz

Senior member
Feb 4, 2021
486
880
136
Cerny mentioned, that they want to integrate a more general purpose CNN/DNN hardware architecture for PS6.

They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.