Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,655
146

Timorous

Golden Member
Oct 27, 2008
1,977
3,861
136
They have already a N33 part for 128 bit performance, and without cache such a arrangement would be quite inefficient. I simply don't see such a market, especially for parts made on a very expensive process. Lower-end parts, for sure, but for making those you'll want to use N6.

I am thinking science and pure compute workloads rather than a discrete GPU but I don't know if there are any that don't really care about bandwidth.

In my edit I also had the though that perhaps the cache dies are cross compatible with Zen 3, Zen 4 and RDNA 3 which means you can't have the MC on there so putting it on the GCD allows for flexibility on the cache side.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
Science and pure compute are what CDNA2 parts are for (and CDNA2 has massive bandwidth with HBME2 stacks). RDNA are pure gaming/rasterization cards, Quite probably cache dies are not compatible as GPU and CPU will have different form factors, moreover the GPU cache dies must have interconnects for both dies, while CPU stacks are linked to a single die (and this, looking at recent Zen4 leaks, will not change with the next generation). Let's add that s quite probable that while CPU stacks are going to be mounted on top GPU cache stacks will be probably on the bottom of the GCDs, so they are quite likely to be different.
 
Last edited:

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
Correct. At worst AMD will likely achieve around ~80 MTr/mm2 and at best ~90 MTr/mm2.

RDNA2 was 26.8B Xtors @ 520 mm2 (51.54 MTr/mm2). If they achieve Apple's density gain of 1.5x going from N7 to N5, you get 77 MTr/mm2 (~80). The 1.5x is a blended number accounting for cache, IO, logic, etc. We know that N31 chiplets will likely strip out the Infinity Cache so it's possible the density increase skews towards the 1.8x TSMC claims for logic scaling, in which case you end up with the upper limit of 92.78 MTr/mm2.

Just scaling down N21 from N7 to N5 using assuming 1.5x scaling should result in a die size of 350 mm2. If you strip out the Infinity Cache from N21 (~80 mm2), the remaining is ~440mm2, which should just be mostly logic and IO. Scale that down by 1.8x and you're left with 244 mm2. Scale that back up by 1.5x to account for the 50% more CUs that it's rumored to have and you're back in the ~350 mm2 range.

Would the 350 mm2 before splitting the die into 2 chiplets, resulting in ~ 2 x 175 mm2?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
It makes much more sense to tie the memory controllers to the cache die(s) than to the GCDs, especially when your memory hyerarchy goes to L3 first and VRAM later, and with the memory interfaces being the parts scaling worse, it is really wasted to put them on a costly N5 die.

In theory, but if the process for L3 is optimized for SRAM density and size, adding the analog may not be such a good idea.

Maybe next level of chiplet partitioning for RDNA4
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
In theory, but if the process for L3 is optimized for SRAM density and size, adding the analog may not be such a good idea.

Maybe next level of chiplet partitioning for RDNA4

In a monolithic die you have already both, so it really should not matter. And idk if cache in a a"cache only" die could be really made denser (and why, as N6 is quite less expensive than N5).
 

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
In a monolithic die you have already both, so it really should not matter. And idk if cache in a a"cache only" die could be really made denser (and why, as N6 is quite less expensive than N5).

It was mentioned that SRAM in V-Cache has nearly 2x density compared to SRAM in base CCD die.
 

Saylick

Diamond Member
Sep 10, 2012
4,055
9,480
136
Oh, I see, so the estimate for Navi31, as far as resources is 2 x Navi21 * 1.5 ?
Yes, current rumors are that there's 3x the execution units in total for N31 vs. N21. Split across two compute chiplet dies, each die would need to have 1.5x the units.
 
  • Like
Reactions: Tlh97 and Joe NYC

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
It was mentioned that SRAM in V-Cache has nearly 2x density compared to SRAM in base CCD die.

That is not really true, people forget that in the CCX there is also all the control and routing logic, i.e. looking at the die shot here


You can see that the actual SRAM area is almost half the size of the supposed "cache" area, and quite probably the stacked cache die has much less of that additional logic. In any case, as said it would be quite strange to add memory interfaces to the chiplet because of the increased costs and also because this would cause a worse uniformity in memory access (imagine a workgroup in a chiplet needing to access the memory linked to the other chiplet, this would add unnecessary and not predictable latency). It has much more sense to keep memory hyerarchy more consistent, and this can be achieved only by having the memory interfaces linked to the L3 cache.
 

Attachments

  • 1629397021747.png
    1629397021747.png
    11.9 MB · Views: 23

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
That is not really true, people forget that in the CCX there is also all the control and routing logic, i.e. looking at the die shot here


You can see that the actual SRAM area is almost half the size of the supposed "cache" area, and quite probably the stacked cache die has much less of that additional logic. In any case, as said it would be quite strange to add memory interfaces to the chiplet because of the increased costs and also because this would cause a worse uniformity in memory access (imagine a workgroup in a chiplet needing to access the memory linked to the other chiplet, this would add unnecessary and not predictable latency). It has much more sense to keep memory hyerarchy more consistent, and this can be achieved only by having the memory interfaces linked to the L3 cache.

2x density was approximation / exaggeration on my part, but AMD did say said that the V-Cache SRAM is higher density then SRAM on CCD die,

If MCD is stacked active silicon bridge, the desire was likely to make its structure very simple, no bumps, nothing routed through substrate, all access and power through TSVs. (Just a guess).
 
  • Like
Reactions: Tlh97 and Saylick

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
2x density was approximation / exaggeration on my part, but AMD did say said that the V-Cache SRAM is higher density then SRAM on CCD die,

If MCD is stacked active silicon bridge, the desire was likely to make its structure very simple, no bumps, nothing routed through substrate, all access and power through TSVs. (Just a guess).

Yes, I am not saying the cache die is more dense in terms of transistor/mm^2, but as I said it is a given due to the cache die missing a lot of routing that goes though the CCX itself and that is needed for the correct operation of the CPU (and this lowers the density in terms of Mbyes/mm^2). Even in N21, you have a total SRAM area of aroud 80mm^2 for 128Mbytes (around 80 Mxtor/mm^2) but total chip density is 51,3 xtor/mm^2 and not only because of the memory interfaces but also because of the inerconnections. But, when you put the memory interface on the GCD; you need to put also a lot of that on the GCD itself, and GCD is 5nm and its silicon costs 50% and more than the silicon on a N6 process. Also you need some routing on the cache die itself, as in this case you need to route data between the two GCDs and not only between cache on die and cache off-die. And, finally, the question of memory access uniformity is, to me, quite important for graphics workload balancing. But of course I may be wrong, as other people may be wrong as well, because there is not a definite leak about where the memory bus is located. By the way, if you stack the GCDs on the cache, there is literally no need of routing anything through the substrate regardless of where the memory bus is , all the communication would go through the TSVs and the internal routing of the cache die, also the "bumps" for connecting to the package would be on other zones of the die, even on the opposide side of the TSVs as in the ZEN3 die.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
Yes, I am not saying the cache die is more dense in terms of transistor/mm^2, but as I said it is a given due to the cache die missing a lot of routing that goes though the CCX itself and that is needed for the correct operation of the CPU (and this lowers the density in terms of Mbyes/mm^2). Even in N21, you have a total SRAM area of aroud 80mm^2 for 128Mbytes (around 80 Mxtor/mm^2) but total chip density is 51,3 xtor/mm^2 and not only because of the memory interfaces but also because of the inerconnections. But, when you put the memory interface on the GCD; you need to put also a lot of that on the GCD itself, and GCD is 5nm and its silicon costs 50% and more than the silicon on a N6 process. Also you need some routing on the cache die itself, as in this case you need to route data between the two GCDs and not only between cache on die and cache off-die. And, finally, the question of memory access uniformity is, to me, quite important for graphics workload balancing. But of course I may be wrong, as other people may be wrong as well, because there is not a definite leak about where the memory bus is located. By the way, if you stack the GCDs on the cache, there is literally no need of routing anything through the substrate regardless of where the memory bus is , all the communication would go through the TSVs and the internal routing of the cache die, also the "bumps" for connecting to the package would be on other zones of the die, even on the opposide side of the TSVs as in the ZEN3 die.

I think the concert about non uniformity of the memory accesses, I think this should be minimized compared to, say Zen 1, because the link between GCDs should have very low latency, and high bandwidth.

Separating memory controllers would probably need a 3rd die, which would be quite small and sparse, with a lot of bumps, so perhaps negating the advantage of making the GCD smaller. It would also need a very fast connection (stacked again?)

As far as combining MCD with memory controllers, we don't know if MCD is one die, or several dies stacked or several dies in parallel (multiple connections between MCDs) I have seen that interpretation as well. So while it seems desirable, we don't know the trade-offs or loss of flexibility.
 

Mopetar

Diamond Member
Jan 31, 2011
8,496
7,753
136
That is the total bus size, not the per-die bus size. As in, the $1999 RX 7950 XT will have the same memory bandwidth and mining performance as the $399 RTX 3060 Ti.

Unless there's some other magic trick there that seems insufficient to feed the alleged monster were told that the next top-end AMD card is supposed to be.

The only alternative I can think of is that the cache is so massive across all of those dies that it winds up offsetting the smaller bus.
 

Kepler_L2

Golden Member
Sep 6, 2020
1,005
4,298
136
Unless there's some other magic trick there that seems insufficient to feed the alleged monster were told that the next top-end AMD card is supposed to be.

The only alternative I can think of is that the cache is so massive across all of those dies that it winds up offsetting the smaller bus.
512MB IC + improvements in DCC are enough ;)
 
  • Like
Reactions: Tlh97 and Joe NYC

Saylick

Diamond Member
Sep 10, 2012
4,055
9,480
136
As long as they. improve the hit rate and make the cache sufficiently large, they don’t need a big, fast, expensive, power hungry bus.
Going off of this slide from HC33, doubling the size of the Infinity Cache should improve the hit rate for 4K gaming to ~80%. At 1080p and 1440p, it looks like increasing the size of the LLC won't help the hit rate all that much.

HC33-AMD-RDNA-2-Bandwidth-Problem.jpg
 
  • Like
Reactions: lightmanek

Joe NYC

Diamond Member
Jun 26, 2021
3,693
5,230
136
Going off of this slide from HC33, doubling the size of the Infinity Cache should improve the hit rate for 4K gaming to ~80%. At 1080p and 1440p, it looks like increasing the size of the LLC won't help the hit rate all that much.

HC33-AMD-RDNA-2-Bandwidth-Problem.jpg

With > $1,000 graphics card (with 512 MB Infinity Cache), in 2022 onwards, the users will probably start to gravitate to 4k monitors.

At 1440, higher hit rate means less power consumption from memory, which then means GPU has more of the power budget.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
Going off of this slide from HC33, doubling the size of the Infinity Cache should improve the hit rate for 4K gaming to ~80%. At 1080p and 1440p, it looks like increasing the size of the LLC won't help the hit rate all that much.

HC33-AMD-RDNA-2-Bandwidth-Problem.jpg

That graph stops at 140MB though, while as you say for N31 rumored size is 512MB. It is likely that at that size the hit rate is near or even above 90% for 1440p or similar, with related bandwidth amplification effects. Well, AMD choice seems to be quite effective in the short term: GDDR6X is mostly a Nvidia exclusive (moreover it requires a different signalling, and it seems quite power hungry) so excluding EBM because of the costs we have that best memory available would have been GDDR6. Even by pushing it in the 18-20Gbps and with a 384bit bus, it would have been not enough, while with IC N31-32 will have an amplified bandwidth probably in the 2TB/s range (AMD declares more than 1TB/s of "delivered" bandwidth for N21 already). Moreover, the more the VRAM frequency increases, the more power consumption in the bus increases. Probably in the future EBM will become more viable and we will see some consumer design based on it. But for the moment it seems this IC solution is a fair compromise.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

Saylick

Diamond Member
Sep 10, 2012
4,055
9,480
136
That graph stops at 140MB though, while as you say for N31 rumored size is 512MB. It is likely that at that size the hit rate is near or even above 90% for 1440p or similar, with related bandwidth amplification effects. Well, AMD choice seems to be quite effective in the short term: GDDR6X is mostly a Nvidia exclusive (moreover it requires a different signalling, and it seems quite power hungry) so excluding EBM because of the costs we have that best memory available would have been GDDR6. Even by pushing it in the 18-20Gbps and with a 384bit bus, it would have been not enough, while with IC N31-32 will have an amplified bandwidth probably in the 2TB/s range (AMD declares more than 1TB/s of "delivered" bandwidth for N21 already). Moreover, the more the VRAM frequency increases, the more power consumption in the bus increases. Probably in the future EBM will become more viable and we will see some consumer design based on it. But for the moment it seems this IC solution is a fair compromise.
Yep, traditional GDDR memory isn't looking like it can scale efficiently when you transition to MCM. GPU manufacturers aren't going to slap on 2x384-bit memory interfaces for dual die MCM GPUs because it just eats too much into the power budget. It's rumored that Nvidia's Lovelace is going to slap on a large L2 cache (size unknown, but FWIW Nvidia A100 has 40 MB L2 cache); if so, it means they've probably also realized the same thing as AMD.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,517
1,358
136
Around 2008/09 the brought a 1080P decoder chip to laptops and integrated graphics. People with discrete GPU's (without 1080p decoder chip) were using the brute power of the cards to decode 1080p while integrated graphics sailed through 1080p content. They need to come out with a 4K decoder. 1440p is nothing. Once you step up to 4k, particularly on multiple monitors. It's a butcher to performance. I am running 1440p on 3 monitors. My temps on my 1660S went up 1-2F when I upgraded the 3rd monitor from 1080p to 1440p. Going to 4k would make any card huff and puff all the time. A 1660S consumes barely any power so huffing and puffing is fine but a 3070 and above consumes significantly more power. If AMD adds a 4k decoder chip to the RDNA3 cards. It would chew through high resolution like nothing.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

Mopetar

Diamond Member
Jan 31, 2011
8,496
7,753
136
Yep, traditional GDDR memory isn't looking like it can scale efficiently when you transition to MCM. GPU manufacturers aren't going to slap on 2x384-bit memory interfaces for dual die MCM GPUs because it just eats too much into the power budget.

If you were going to go with an MCM design why make each chip have such a large bus when you could probably do fine with each die having a smaller bus that results in a combined 384-bit interface? That was part of my argument and I'm still rather skeptical about any claims that future RDNA products that utilize an MCM design will have a total bus size as small as something like 192-bit. If that's using two dies it means that each only has a 96-bit bus. I can't even think of anything that's used a configuration like that before.

Around 2008/09 the brought a 1080P decoder chip to laptops and integrated graphics. . . If AMD adds a 4k decoder chip to the RDNA3 cards. It would chew through high resolution like nothing.

4K displays are still rather rare, so I'm not sure if AMD feels a lot of need to do something like this outside of top-end GPUs. It really comes down to how much additional dies space such a decoder would need. Although it's far from a scientific poll, the Steam hardware survey does include resolution and 4K resolution accounts for 2.26% in their results. Even 1440p accounts for less than 10% or almost exactly 10% if you include the 16:10 and ultra-wide QHD resolutions.

Adding in hardware that uses extra die space that will go unused by the vast majority of users just doesn't seem likely to be high on the priority list. If it doesn't increase the die size much or at all, they might certainly add it, but it's going to take a bit longer for more people to adopt 4K displays. One thing that may help is high-end laptops and ultrabooks pushing higher resolutions and leading towards a card outside of a high-end gaming one including a 4K decoder.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
If you were going to go with an MCM design why make each chip have such a large bus when you could probably do fine with each die having a smaller bus that results in a combined 384-bit interface? That was part of my argument and I'm still rather skeptical about any claims that future RDNA products that utilize an MCM design will have a total bus size as small as something like 192-bit. If that's using two dies it means that each only has a 96-bit bus. I can't even think of anything that's used a configuration like that before.

Because the problem is indeed the total width of the bus, as adding width adds external high frequency signals to be passed from the GPU to the VRAM, and analog parts on the GPU die(s), both solutions being more power hungry than having a smaller bus+cache.
 

Mopetar

Diamond Member
Jan 31, 2011
8,496
7,753
136
Because the problem is indeed the total width of the bus, as adding width adds external high frequency signals to be passed from the GPU to the VRAM, and analog parts on the GPU die(s), both solutions being more power hungry than having a smaller bus+cache.

I know. I'm saying don't do that. If you want a 256-bit bus split it so that each chip has a 128-bit bus. Obviously this can create an issue where some memory modules aren't directly accessible by a particular die, but even the dies of today have fast interconnects in place and it's not too hard to design them to be able to pass requests through each other.

If it were a big problem the obvious solution is an IO die similar to what's used in Zen CPUs.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136
I know. I'm saying don't do that. If you want a 256-bit bus split it so that each chip has a 128-bit bus. Obviously this can create an issue where some memory modules aren't directly accessible by a particular die, but even the dies of today have fast interconnects in place and it's not too hard to design them to be able to pass requests through each other.

If it were a big problem the obvious solution is an IO die similar to what's used in Zen CPUs.

According to the rumors we will have something similar.