Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,622
5,880
146

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
UNIFORM CACHE SYSTEM FOR FAST DATA ACCESS

Someone at twitter found this patent some time ago, only now looking again at @Kepler_L2 diagram of RDNA3, i think i understood...
Each SIMD will have it's own L0 cache, if a miss, it will look at others L0 inside the WGP, since the SIMDs work in pairs, there is 6 L0s to search
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,125
6,296
136

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I find it very interesting that both AMD and Nvidia are converging on similar architectural techniques.
Shouldn't be too surprising. Both players move within the same ecosystem (OS, graphics abstraction layers, games) for which they continuously have to optimize their hardware and drivers for. The biggest divergence naturally is in new tech, e.g. ray tracing. Though the convergence does seem to speed up.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136

I guess once the gains from a new process diminish and new process take longer to be viable, the architecutre become much more important both for performance and efficiency. It make sense there really is a best design and they both are converging to it.
 
  • Like
Reactions: Tlh97

GodisanAtheist

Diamond Member
Nov 16, 2006
6,783
7,115
136
I doubt this is converging to "the" best design imaginable, but within the already existing ecosystem specific approaches are clearly more favorable than others.

- Naturally there are going to be limitations with regard to licensed IPs, patents, existing architecture etc.

But engines have matured (and reduced in number), rendering philosophies have standardized, consoles are basically x86 computers. Makes sense that there are only so many ways to skin a cat and one of those ways is going to be the most efficient regardless of the knife you're using.
 

Saylick

Diamond Member
Sep 10, 2012
3,125
6,296
136
If AMD and Nvidia continue converging, it seems like the differentiator(s) will be:
1) Software, along with any dedicated hardware required to either make that software work or to accelerate it enough such that it is viable, e.g. using tensor cores for DLSS.
2) Advanced packaging, which potentially brings cost savings to the consumer.

I can see Nvidia making a BIG divergence if we cross over the tipping point where there are more AAA games that use real time RT as the primary shading algorithm (over pre-baked GI) than not. Then we might see GPUs with more RT performance than traditional raster performance.

Similarly, AMD has a lead on the advanced packaging side of things. RDNA 3 will be their first foray using it in the consumer GPU space, and RDNA 4 appears to be a refined RDNA 3 but with more packaging techniques to fit even more silicon under the hood.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I can see Nvidia making a BIG divergence if we cross over the tipping point where there are more AAA games that use real time RT as the primary shading algorithm (over pre-baked GI) than not. Then we might see GPUs with more RT performance than traditional raster performance.
I honestly have a hard time imagining this happening anytime soon. AAA games will keep targeting consoles. The current console gen is capable of RT, but not in a way developers can solely rely on it. The big question is how much effort those developers put in for PC-only RT focused optimizations. With engines like Unreal Engine 5 (Lumen) going the software optimized route instead I'm expecting hardware RT game development happening mostly with PC only/first developers.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
UNIFORM CACHE SYSTEM FOR FAST DATA ACCESS

Someone at twitter found this patent some time ago, only now looking again at @Kepler_L2 diagram of RDNA3, i think i understood...
Each SIMD will have it's own L0 cache, if a miss, it will look at others L0 inside the WGP, since the SIMDs work in pairs, there is 6 L0s to search
It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.
AMD went with a 4 level cache (L0, L1, L2, LLC) which works out well, and I think they will stick to it.
I think RDNA2 is more effective with a slightly smaller SA than the top config, due to lesser clients at GL1

So they finally implemented this, eh?

View attachment 63688
RDNA1 implemented a part of the above concept with per SA GL1, which help bandwidth pressure greatly.
1656443922382.png
Of course L0 is per CU only, but the CUs has a unified LDS which they can both use.
There are some quirks but check out this video from Lou Kramer

 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
I think AMD intends to leave that area to their/Xilinx's AI Engine.
This WMMA is different, it is built within the SIMD not offloaded to a sub unit like a tensor core.
It means you can run normal logic using these new ops as part of the same shader wave front but with massive throughput.
I think they might have something planned for this. FSR 2.0 update launching with RDNA3? Especially that BF16/FP16 number format.

Update:
ROCm 5.2 upcoming release describes the above statement well. They did say Specialized GPU matrix core, but in the RDNA patch it is just another SIMD V_XXX op (e.g. v_wmma_f16_16x16x16_f16, using the same VGPR banks ) meaning it is running (or at least transparently appear to be running ) in the SIMD as a wavefront

New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration
This release introduces a new ROCm C++ library for accelerating mixed precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
 
Last edited:

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.
Keep in mind, from the other patents: dual vector and super simd, both relie on auxiliar caches near the registry files
 
  • Like
Reactions: Glo.

Frenetic Pony

Senior member
May 1, 2012
218
179
116
It looks like this patent did not originate from the core graphics architecture team in US/CA.
I think it is a bit tough to implement it because sharing the L0 across so many CUs needs many crossbars for readports , I don't think the RDNA architecture is fully NoC at CU level.
Very unlikely to happen due to area and power costs and having too many clients to the L0 affects latency.
AMD went with a 4 level cache (L0, L1, L2, LLC) which works out well, and I think they will stick to it.
I think RDNA2 is more effective with a slightly smaller SA than the top config, due to lesser clients at GL1


RDNA1 implemented a part of the above concept with per SA GL1, which help bandwidth pressure greatly.
View attachment 63700
Of course L0 is per CU only, but the CUs has a unified LDS which they can both use.
There are some quirks but check out this video from Lou Kramer

It does sound like IBM's shared cache system, which they apparently got working, but if I'm remembering right that's just for L2s? Something like that, and on a CPU. IBM did mention latency penalties but apparently it was still worth implementing, though for such a low level GPU cache you might be right.

I haven't a clue what you mean.

Sorry, two of those compute dies on the same package/GPU. 3 GPUs limited to one compute die each isn't enough to cover your entire price range. If you can optionally use 1 or 2 compute dies per GPU then suddenly you can cover a lot more price points.
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
Sorry, two of those compute dies on the same package/GPU. 3 GPUs limited to one compute die each isn't enough to cover your entire price range. If you can optionally use 1 or 2 compute dies per GPU then suddenly you can cover a lot more price points.
The mainstream thinking now is 2 compute die designs. 1st for full & cut N31, then a 2nd smaller compute die for N32, also giving 2 more SKUs. The cache, memory controllers and IO on multiple other die & used for both N31 & N32. Monolithic N33 follows.
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
The mainstream thinking now is 2 compute die designs. 1st for full & cut N31, then a 2nd smaller compute die for N32, also giving 2 more SKUs. The cache, memory controllers and IO on multiple other die & used for both N31 & N32. Monolithic N33 follows.

Again, that's not nearly enough to cover your price points. The cut dies for smaller version will be almost non existent as yield will be too high, and the cuts are traditionally about 20% lower performance, which isn't much in the way of price difference. You can't do $1,600 and $1,200 gpu then drop all the way down to a $500 and $400 dollar GPU, then drop all the way to $200 again, for example. You're just leaving a $700 dollar gap and a $200 gap (that $200 gap is double the price! the $700 one is more than double) where you're not even trying to compete with your competition. Potential money is just being given away because you didn't make enough SKUs.

Now if you can pair up compute dies then it looks a lot different. You can do $200, $300, $450, $600, $850, $1,200, and $1,600 and you've got options across the range, and you don't have to pay much extra for design costs. To me this alone heavily favors the idea of being able to choose one or two compute dies per GPU.
 

jpiniero

Lifer
Oct 1, 2010
14,584
5,206
136
Again, that's not nearly enough to cover your price points. The cut dies for smaller version will be almost non existent as yield will be too high, and the cuts are traditionally about 20% lower performance, which isn't much in the way of price difference. You can't do $1,600 and $1,200 gpu then drop all the way down to a $500 and $400 dollar GPU

N33 is probally $600-$700, assuming it's somewhere between the 4070 and 4080 in performance.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
You really are in the bubble. By the time these release, our economies are going to look a a lot different. What these companies want & what they'll get are going to be very different.
Nvidia will try to keep prices high since it wants to be seen as a premium brand, and nothing says premium like outrageous prices. If that succeeds AMD will happily play along.
 

DiogoDX

Senior member
Oct 11, 2012
746
277
136
N33 should come to replace the N23 (6600XT) and have max $400 MRSP since is rumored use the cheaper 6nm. Maybe if AMD launch first they can get way with $500 and marketing as a "6900XT for half the price".
 
Last edited: