Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

Olikan · Jun 28, 2022

UNIFORM CACHE SYSTEM FOR FAST DATA ACCESS

https://www.freepatentsonline.com/y2022/0188232.html

Someone at twitter found this patent some time ago, only now looking again at @Kepler_L2 diagram of RDNA3, i think i understood...
Each SIMD will have it's own L0 cache, if a miss, it will look at others L0 inside the WGP, since the SIMDs work in pairs, there is 6 L0s to search

Saylick · Jun 28, 2022

Olikan said:
UNIFORM CACHE SYSTEM FOR FAST DATA ACCESS

https://www.freepatentsonline.com/y2022/0188232.html

Someone at twitter found this patent some time ago, only now looking again at @Kepler_L2 diagram of RDNA3, i think i understood...
Each SIMD will have it's own L0 cache, if a miss, it will look at others L0 inside the WGP, since the SIMDs work in pairs, there is 6 L0s to search

So they finally implemented this, eh?

Olikan · Jun 28, 2022

Saylick said:
So they finally implemented this, eh?

View attachment 63688

Indeed, just the patent have weird ways to explain it... "a CU surrounded by L0s"

Saylick · Jun 28, 2022

Not to turn this into an Nvidia thread, but I find it very interesting that both AMD and Nvidia are converging on similar architectural techniques.
- AMD going back to a software based scheduler, which Nvidia had since Kepler.
- AMD using dual dispatch, which Nvidia had for a long time.
- Nvidia having a large L2 cache, which AMD did via Infinity Cache.
- Nvidia bundling SMs together, which AMD did via WGP in RDNA1.

moinmoin · Jun 28, 2022

Saylick said:
I find it very interesting that both AMD and Nvidia are converging on similar architectural techniques.

Shouldn't be too surprising. Both players move within the same ecosystem (OS, graphics abstraction layers, games) for which they continuously have to optimize their hardware and drivers for. The biggest divergence naturally is in new tech, e.g. ray tracing. Though the convergence does seem to speed up.

beginner99 · Jun 28, 2022

Saylick said:
Not to turn this into an Nvidia thread, but I find it very interesting that both AMD and Nvidia are converging on similar architectural techniques.
- AMD going back to a software based scheduler, which Nvidia had since Kepler.
- AMD using dual dispatch, which Nvidia had for a long time.
- Nvidia having a large L2 cache, which AMD did via Infinity Cache.
- Nvidia bundling SMs together, which AMD did via WGP in RDNA1.

I guess once the gains from a new process diminish and new process take longer to be viable, the architecutre become much more important both for performance and efficiency. It make sense there really is a best design and they both are converging to it.

moinmoin · Jun 28, 2022

beginner99 said:
It make sense there really is a best design and they both are converging to it.

I doubt this is converging to "the" best design imaginable, but within the already existing ecosystem specific approaches are clearly more favorable than others.

GodisanAtheist · Jun 28, 2022

moinmoin said:
I doubt this is converging to "the" best design imaginable, but within the already existing ecosystem specific approaches are clearly more favorable than others.

- Naturally there are going to be limitations with regard to licensed IPs, patents, existing architecture etc.

But engines have matured (and reduced in number), rendering philosophies have standardized, consoles are basically x86 computers. Makes sense that there are only so many ways to skin a cat and one of those ways is going to be the most efficient regardless of the knife you're using.

Saylick · Jun 28, 2022

If AMD and Nvidia continue converging, it seems like the differentiator(s) will be:
1) Software, along with any dedicated hardware required to either make that software work or to accelerate it enough such that it is viable, e.g. using tensor cores for DLSS.
2) Advanced packaging, which potentially brings cost savings to the consumer.

I can see Nvidia making a BIG divergence if we cross over the tipping point where there are more AAA games that use real time RT as the primary shading algorithm (over pre-baked GI) than not. Then we might see GPUs with more RT performance than traditional raster performance.

Similarly, AMD has a lead on the advanced packaging side of things. RDNA 3 will be their first foray using it in the consumer GPU space, and RDNA 4 appears to be a refined RDNA 3 but with more packaging techniques to fit even more silicon under the hood.

moinmoin · Jun 28, 2022

Saylick said:
I can see Nvidia making a BIG divergence if we cross over the tipping point where there are more AAA games that use real time RT as the primary shading algorithm (over pre-baked GI) than not. Then we might see GPUs with more RT performance than traditional raster performance.

I honestly have a hard time imagining this happening anytime soon. AAA games will keep targeting consoles. The current console gen is capable of RT, but not in a way developers can solely rely on it. The big question is how much effort those developers put in for PC-only RT focused optimizations. With engines like Unreal Engine 5 (Lumen) going the software optimized route instead I'm expecting hardware RT game development happening mostly with PC only/first developers.

DisEnchantment · Jun 28, 2022

Olikan said:
UNIFORM CACHE SYSTEM FOR FAST DATA ACCESS

https://www.freepatentsonline.com/y2022/0188232.html

Someone at twitter found this patent some time ago, only now looking again at @Kepler_L2 diagram of RDNA3, i think i understood...
Each SIMD will have it's own L0 cache, if a miss, it will look at others L0 inside the WGP, since the SIMDs work in pairs, there is 6 L0s to search

It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.
AMD went with a 4 level cache (L0, L1, L2, LLC) which works out well, and I think they will stick to it.
I think RDNA2 is more effective with a slightly smaller SA than the top config, due to lesser clients at GL1

Saylick said:
So they finally implemented this, eh?

View attachment 63688

RDNA1 implemented a part of the above concept with per SA GL1, which help bandwidth pressure greatly.

Of course L0 is per CU only, but the CUs has a unified LDS which they can both use.
There are some quirks but check out this video from Lou Kramer

DisEnchantment · Jun 28, 2022

RDNA3 gets matrix FMA ops for FP16/BF16 and FP32
Wave32/64 Tensor ops

⚙ D128756 [AMDGPU] gfx11 WMMA instruction support

reviews.llvm.org

~~Very interesting, AMD is not interested in low precision matrix ops.~~

Also Dual wave32 ops per cycle per CU further support added.

⚙ D128656 [AMDGPU] gfx11 Generate VOPD Instructions

reviews.llvm.org

using MacroOpsFusion style from CPU

moinmoin · Jun 28, 2022

DisEnchantment said:
Very interesting, AMD is not interested in low precision matrix ops.

I think AMD intends to leave that area to their/Xilinx's AI Engine.

DisEnchantment · Jun 28, 2022

moinmoin said:
I think AMD intends to leave that area to their/Xilinx's AI Engine.

This WMMA is different, it is built within the SIMD not offloaded to a sub unit like a tensor core.
It means you can run normal logic using these new ops as part of the same shader wave front but with massive throughput.
I think they might have something planned for this. FSR 2.0 update launching with RDNA3? Especially that BF16/FP16 number format.

Update:
ROCm 5.2 upcoming release describes the above statement well. They did say Specialized GPU matrix core, but in the RDNA patch it is just another SIMD V_XXX op (e.g. v_wmma_f16_16x16x16_f16, using the same VGPR banks ) meaning it is running (or at least transparently appear to be running ) in the SIMD as a wavefront

New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration

Click to expand...

This release introduces a new ROCm C++ library for accelerating mixed precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

Olikan · Jun 28, 2022

DisEnchantment said:
It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.

Keep in mind, from the other patents: dual vector and super simd, both relie on auxiliar caches near the registry files

Frenetic Pony · Jun 29, 2022

DisEnchantment said:
It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.
AMD went with a 4 level cache (L0, L1, L2, LLC) which works out well, and I think they will stick to it.
I think RDNA2 is more effective with a slightly smaller SA than the top config, due to lesser clients at GL1

RDNA1 implemented a part of the above concept with per SA GL1, which help bandwidth pressure greatly.
View attachment 63700
Of course L0 is per CU only, but the CUs has a unified LDS which they can both use.
There are some quirks but check out this video from Lou Kramer

It does sound like IBM's shared cache system, which they apparently got working, but if I'm remembering right that's just for L2s? Something like that, and on a CPU. IBM did mention latency penalties but apparently it was still worth implementing, though for such a low level GPU cache you might be right.

maddie said:
I haven't a clue what you mean.

Sorry, two of those compute dies on the same package/GPU. 3 GPUs limited to one compute die each isn't enough to cover your entire price range. If you can optionally use 1 or 2 compute dies per GPU then suddenly you can cover a lot more price points.

maddie · Jun 29, 2022

Frenetic Pony said:
Sorry, two of those compute dies on the same package/GPU. 3 GPUs limited to one compute die each isn't enough to cover your entire price range. If you can optionally use 1 or 2 compute dies per GPU then suddenly you can cover a lot more price points.

The mainstream thinking now is 2 compute die designs. 1st for full & cut N31, then a 2nd smaller compute die for N32, also giving 2 more SKUs. The cache, memory controllers and IO on multiple other die & used for both N31 & N32. Monolithic N33 follows.

Frenetic Pony · Jun 29, 2022

maddie said:
The mainstream thinking now is 2 compute die designs. 1st for full & cut N31, then a 2nd smaller compute die for N32, also giving 2 more SKUs. The cache, memory controllers and IO on multiple other die & used for both N31 & N32. Monolithic N33 follows.

Again, that's not nearly enough to cover your price points. The cut dies for smaller version will be almost non existent as yield will be too high, and the cuts are traditionally about 20% lower performance, which isn't much in the way of price difference. You can't do $1,600 and $1,200 gpu then drop all the way down to a $500 and $400 dollar GPU, then drop all the way to $200 again, for example. You're just leaving a $700 dollar gap and a $200 gap (that $200 gap is double the price! the $700 one is more than double) where you're not even trying to compete with your competition. Potential money is just being given away because you didn't make enough SKUs.

Now if you can pair up compute dies then it looks a lot different. You can do $200, $300, $450, $600, $850, $1,200, and $1,600 and you've got options across the range, and you don't have to pay much extra for design costs. To me this alone heavily favors the idea of being able to choose one or two compute dies per GPU.

coercitiv · Jun 29, 2022

Frenetic Pony said:
The cut dies for smaller version will be almost non existent as yield will be too high

You mean the segmentation would have to be... artificial?!

jpiniero · Jun 29, 2022

Frenetic Pony said:
Again, that's not nearly enough to cover your price points. The cut dies for smaller version will be almost non existent as yield will be too high, and the cuts are traditionally about 20% lower performance, which isn't much in the way of price difference. You can't do $1,600 and $1,200 gpu then drop all the way down to a $500 and $400 dollar GPU

N33 is probally $600-$700, assuming it's somewhere between the 4070 and 4080 in performance.

maddie · Jun 29, 2022

jpiniero said:
N33 is probally $600-$700, assuming it's somewhere between the 4070 and 4080 in performance.

You really are in the bubble. By the time these release, our economies are going to look a a lot different. What these companies want & what they'll get are going to be very different.

moinmoin · Jun 29, 2022

maddie said:
You really are in the bubble. By the time these release, our economies are going to look a a lot different. What these companies want & what they'll get are going to be very different.

Nvidia will try to keep prices high since it wants to be seen as a premium brand, and nothing says premium like outrageous prices. If that succeeds AMD will happily play along.

DiogoDX · Jun 29, 2022

N33 should come to replace the N23 (6600XT) and have max $400 MRSP since is rumored use the cheaper 6nm. Maybe if AMD launch first they can get way with $500 and marketing as a "6900XT for half the price".

Kepler_L2 · Jun 30, 2022

jpiniero said:
N33 is probally $600-$700, assuming it's somewhere between the 4070 and 4080 in performance.

N33 will be sub $400.

RnR_au · Jun 30, 2022

Kepler_L2 said:
N33 will be sub $400.

That would make second hand 6700XT's at $300 a hard sell unless you need a card now...

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member

Golden Member

Platinum Member