Discussion RDNA4 + CDNA3 Architectures Thread

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,702
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,332
7,789
136
Memory Bandwidth shouldn't be a problem for AMD (or NV) next year. 32GT/s GDDR7 on a 384 bit interface is a lot. Even if we assume the first gen GDDR7 to be unable to hit the PR value of 32 GT/s, at a lower 28 GT/s, that is an incredible 1344 GB/s. By next year GDDR6 should be able to hit ~1000 GB/s on a 384bit interface @22 GT/s.
According to Ryan's post on the front page, volumes will be too low in 2024 to use on consumer GFX cards. Those dice will be headed to high end HPC/ML GPUs. Also, Samsung want to get the power consumption down for consumer cards.




As an aside - some people think the next round of consumer GPUs won’t hit till 2025. I have two opinions on this. First, I don’t see AMD or NV doing much to change the release schedules of the next generation of AIBs - in part because AIB manufacturers will want something new and shiny to sell, especially given the lackluster sales of this gen. Second, AMD really needs an improvement over the RDNA3 GPUs and, IMHO, can’t afford a delay. Nvidia surely isn’t going to sit still and likely be put in the rear view mirror when AMD top RDNA4 GPU is released.

So, I believe the 2H2024 rumors.
 
  • Like
Reactions: Joe NYC

menhera

Junior Member
Dec 10, 2020
21
66
61
Maybe AMD should merge LDS and two separate L0 cache slices in a WGP, practically doubling L0 without more transitors just as Nvidia did with Turing. Radeon GPU Profiler indicates that most workloads use very little LDS or not at all. In Nvidia's architectures, unused LDS (what Nvidia calls Shared Memory) serves as additional L1 cache. Clever design.
 

SteinFG

Senior member
Dec 29, 2021
386
445
106
I think that effective BW is just marketing BS.
E4EOXVMXEAAa_so

E2rDmb1WEAE7GGf

Averaging It is just nonsense, either you have the data in cache or you don't.
If you have It, then you have the maximum ~1940 GB/s BW in case of RDNA2 N21 or ~4470 GB/s for RDNA3 N31.
If you don't then It leaves you with only GDDR6/7 BW.

P.S. I wonder how much IC would be needed for 90% hitrate at 4K. 1GB? :D
I think 4 stacks of Hynix HBM3E with 4TB/s(1TB/s per Stack) and 64-96GB Vram(16-24GB per Stack) could end up cheaper to make.

Edit: It looks like IC size matters for total BW.
I think what Locuza wrote as theoretical Iinfinity Cache BW is wrong for N23/24. N22 is 0.75 of N21 If we exclude hitrate, but N23 and N24 are not 1/4 and 1/8 of N21.
Then 1GB IC BW exluding hitrate would be 10.67x higher than N31 has?
If I'll take AMD's numbers, there's an easy way to calculate the L3 hitrate they're expecting. And about N23/N24 - yes, it's ~1/4 and ~1/8, otherwise numbers don't add up at all.
1690130961071.png
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,319
2,805
106
If I'll take AMD's numbers, there's an easy way to calculate the L3 hitrate they're expecting. And about N23/N24 - yes, it's ~1/4 and ~1/8, otherwise numbers don't add up at all.
View attachment 83432
Your L3 hit rate is seriously wrong.
How can N22's 96MB IC have higher hit rate than N21's 128MB IC? That's nonsense. Or N23's 32MB IC having only 3% higher hit rate than N24's 16MB IC?
N31 has the same amount of IC as N22, yet you got only 48% vs 60%.
You should recalculate It. BTW, hit rates are already provided in the second picture.
 
Last edited:

SteinFG

Senior member
Dec 29, 2021
386
445
106
How can N22's 96MB IC have higher hit rate than N21's 128MB IC?
Because N22 is targeting 1440P, and N21 is targeting 4K. you already shared a slide where a card has different hit rate depending on the resolution, so it shouldn't be hard to grasp.
Or N23's 32MB IC having only 3% higher hit rate than N24's 16MB IC?
That's a weird one, don't have an explanation.
Maybe there's something I'm missing, most likely L3 BW is wrong on this one, as there's no info about that, and I'm just taking a % of N21 L3 instead.
EDIT: one possible explanation is that because N24 works with half the VRAM speed and half the VRAM size of N23, it has almost the same hit rate even with half of L3. idk.
You should recalculate It. BTW, hit rates are already provided in the second picture.
1) hit rates is what I was calculating, so kinda defeats the purpose
2) the hit rates on the slide shown have huge deltas, can't really do any math with that. I prefer concrete numbers more, that's why I took the route of searching for exact numbers, even if they're mostly for marketing.
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,319
2,805
106
Because N22 is targeting 1440P, and N21 is targeting 4K. you already shared a slide where a card has different hit rate depending on the resolution, so it shouldn't be hard to grasp.
Then you should mention that in your table, It's not like I know where you got those effective BW data except the one for N21. Where did you find them?
Also what you calculated vs data from graph are quite different.
N21 58% vs 62%. This isn't that different
N22 60% vs 69%. This is very different
N23 and N24 looks weird, but you already know It.

@SteinFG
EDIT: one possible explanation is that because N24 works with half the VRAM speed and half the VRAM size of N23, it has almost the same hit rate even with half of L3. idk.
Hit rate depends on resolution and infinity cache size. Vram size or speed has nothing to do with It.
N24 could have comparable hitrate to N23 only If It was at a lower resolution.
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,702
136
Maybe AMD should merge LDS and two separate L0 cache slices in a WGP, practically doubling L0 without more transistors just as Nvidia did with Turing. Radeon GPU Profiler indicates that most workloads use very little LDS or not at all. In Nvidia's architectures, unused LDS (what Nvidia calls Shared Memory) serves as additional L1 cache. Clever design.
Using LDS to augment L0 (depicted as L1 in pic) when unused and sharing L0 across SIMDs in a WGP appeared in a patent below.
RDNA2 has problems when one CU in a WGP wants to use data present in another L0 within the WGP. It has to go via LDS. It was in one presentation from Lou Kramer, AMD. Not sure if the same issue is there in RDNA3.
1690133612216.png
PROCESSING DEVICE AND METHOD OF SHARING STORAGE BETWEEN CACHE MEMORY, LOCAL DATA STORAGE AND REGISTER FILES
<https://www.freepatentsonline.com/y2023/0069890.html>

I think you know stuffs, but you are playing along :)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,702
136
Recap of the bunch of new RT related patents too
Compare RT related patents leading up to RDNA3 vs the ones below leading up to RDNA4.

BOUNDING VOLUME HIERARCHY HAVING ORIENTED BOUNDING BOXES WITH QUANTIZED ROTATIONS
From <https://www.freepatentsonline.com/y2023/0099806.html>
ACCELERATION STRUCTURES WITH DELTA INSTANCES
From <https://www.freepatentsonline.com/y2023/0097562.html>
TECHNIQUES FOR INTRODUCING ORIENTED BOUNDING BOXES INTO BOUNDING VOLUME HIERARCHY
From <https://www.freepatentsonline.com/y2023/0027725.html>
SPATIAL HASHING FOR WORLD-SPACE SPATIOTEMPORAL RESERVOIR RE-USE FOR RAY TRACING
From <https://www.freepatentsonline.com/y2022/0406002.html>
MACHINE-LEARNING BASED COLLISION DETECTION FOR OBJECTS IN VIRTUAL ENVIRONMENTS
From <https://www.freepatentsonline.com/y2022/0319096.html>
This one appeared in a paper as well recently. https://gpuopen.com/download/publications/HPG2023_NeuralIntersectionFunction.pdf

OVERLAY TREES FOR RAY TRACING
From <https://www.freepatentsonline.com/y2023/0196669.html>
GRAPHICS PROCESSING UNIT TRAVERSAL ENGINE
From <https://www.freepatentsonline.com/y2023/0206543.html>
VARIABLE WIDTH BOUNDING VOLUME HIERARCHY NODES
From <https://www.freepatentsonline.com/y2023/0206542.html>
COMMON CIRCUITRY FOR TRIANGLE INTERSECTION AND INSTANCE TRANSFORMATION FOR RAY TRACING
From <https://www.freepatentsonline.com/y2023/0206541.html>
BOUNDING VOLUME HIERARCHY BOX NODE COMPRESSION
From <https://www.freepatentsonline.com/y2023/0206540.html>
BVH NODE ORDERING FOR EFFICIENT RAY TRACING
From <https://www.freepatentsonline.com/y2023/0206539.html>
FRUSTUM-BOUNDING VOLUME INTERSECTION DETECTION USING HEMISPHERICAL PROJECTION
From <https://www.freepatentsonline.com/y2023/0206544.html>

What is standing out is that a lot of the new concepts in these patents don't seem to rely solely on shader code.
 

menhera

Junior Member
Dec 10, 2020
21
66
61
Compare RT related patents leading up to RDNA3 vs the ones below leading up to RDNA4.

BOUNDING VOLUME HIERARCHY HAVING ORIENTED BOUNDING BOXES WITH QUANTIZED ROTATIONS
From <https://www.freepatentsonline.com/y2023/0099806.html>
ACCELERATION STRUCTURES WITH DELTA INSTANCES
From <https://www.freepatentsonline.com/y2023/0097562.html>
TECHNIQUES FOR INTRODUCING ORIENTED BOUNDING BOXES INTO BOUNDING VOLUME HIERARCHY
From <https://www.freepatentsonline.com/y2023/0027725.html>
SPATIAL HASHING FOR WORLD-SPACE SPATIOTEMPORAL RESERVOIR RE-USE FOR RAY TRACING
From <https://www.freepatentsonline.com/y2022/0406002.html>
MACHINE-LEARNING BASED COLLISION DETECTION FOR OBJECTS IN VIRTUAL ENVIRONMENTS
From <https://www.freepatentsonline.com/y2022/0319096.html>
This one appeared in a paper as well recently. https://gpuopen.com/download/publications/HPG2023_NeuralIntersectionFunction.pdf

OVERLAY TREES FOR RAY TRACING
From <https://www.freepatentsonline.com/y2023/0196669.html>
GRAPHICS PROCESSING UNIT TRAVERSAL ENGINE
From <https://www.freepatentsonline.com/y2023/0206543.html>
VARIABLE WIDTH BOUNDING VOLUME HIERARCHY NODES
From <https://www.freepatentsonline.com/y2023/0206542.html>
COMMON CIRCUITRY FOR TRIANGLE INTERSECTION AND INSTANCE TRANSFORMATION FOR RAY TRACING
From <https://www.freepatentsonline.com/y2023/0206541.html>
BOUNDING VOLUME HIERARCHY BOX NODE COMPRESSION
From <https://www.freepatentsonline.com/y2023/0206540.html>
BVH NODE ORDERING FOR EFFICIENT RAY TRACING
From <https://www.freepatentsonline.com/y2023/0206539.html>
FRUSTUM-BOUNDING VOLUME INTERSECTION DETECTION USING HEMISPHERICAL PROJECTION
From <https://www.freepatentsonline.com/y2023/0206544.html>

What is standing out is that a lot of the new concepts in these patents don't seem to rely solely on shader code.
Nice patents revealed recently. TRAVERSAL ENGINE looks like something AMD needs most.
 

Aapje

Golden Member
Mar 21, 2022
1,267
1,705
96
As an aside - some people think the next round of consumer GPUs won’t hit till 2025. I have two opinions on this. First, I don’t see AMD or NV doing much to change the release schedules of the next generation of AIBs - in part because AIB manufacturers will want something new and shiny to sell, especially given the lackluster sales of this gen.
What makes you think that Nvidia cares at all about the well-being of AIBs?
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,196
5,197
136
Is 512bit bus even possible with GDDR6/X? Let's say with custom PCB?

Certainly. Why wouldn't it be?

It's just expensive in term of die area, and a pain to route, because that's 16 memory channels, requiring 16 memory chips.

We probably won't see a 512bit+ GDDR bus again. If they need more than 384 bit bus can deliver, they will probably go HBM.
 
  • Like
Reactions: Tlh97 and SmokSmog

SteinFG

Senior member
Dec 29, 2021
386
445
106
With GDDR6W it's easy, just use 8 chips. Each chip has 64 io lanes. 8x64=512. I assume that this is the primary reason for creating GDDR6W.
 
  • Like
Reactions: Ajay

Heartbreaker

Diamond Member
Apr 3, 2006
4,196
5,197
136
With GDDR6W it's easy, just use 8 chips. Each chip has 64 io lanes. 8x64=512. I assume that this is the primary reason for creating GDDR6W.

The benefits are small for GDDR6W. You get a slight board packaging/routing advantage. Mainly useful if you want a very wide GDDR bus for your GPU.

The problem with that, is GPU makers want to use the narrowest bus they can get away with, and they want to have multiple suppliers. Plus there is no improvement in capacity/channels, as these are just double capacity chips, that take two bus channels, leaving capacity at the same amount.

I suppose it could make headway if all the major GDDR makers get on board, in partnership with GPU makers, but overall GDDR6W really doesn't move the needle on anything other than a bit of board packaging benefit.

We might switch to GDDRW or we might not, but it really doesn't matter to the consumer at all. We won't see any benefit from the change, if/when it happens.
 
Last edited:

SteinFG

Senior member
Dec 29, 2021
386
445
106
The benefits are small for GDDR6W. You get a slight board packaging/routing advantage. Mainly useful if you want a very wide GDDR bus for your GPU.

The problem with that, is GPU makers want to use the narrowest bus they can get away with, and they want to have multiple suppliers. Plus there is no improvement in capacity/channels, as these are just double capacity chips, that take two bus channels, leaving capacity at the same amount.

I suppose it could make headway if all the major GDDR makers get on board, in partnership with GPU makers, but overall GDDR6W really doesn't move the needle on anything other than a bit of board packaging benefit.

We might switch to GDDRW or we might not, but it really doesn't matter to the consumer at all. We won't see any benefit from the change, if/when it happens.
The question was how to get 512bit GPU to work with GDDR6, not why it should makes sense 乁⁠|⁠ ⁠・⁠ ⁠〰⁠ ⁠・⁠ ⁠|⁠ㄏ
 

Ajay

Lifer
Jan 8, 2001
15,332
7,789
136
What makes you think that Nvidia cares at all about the well-being of AIBs?
Uh, to a degree, yes. I'm sure EVGA dropping out got their attention. That said, Nvidia has really squeezed their AIBs in recent years and they have responded with higher prices to keep their profits up.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,789
136
With GDDR6W it's easy, just use 8 chips. Each chip has 64 io lanes. 8x64=512. I assume that this is the primary reason for creating GDDR6W.
👍I think we will see GDDR6W in use in consumer GFX cards b/4 GDDR7. Also, pretty sure the GDDR6W's are 32 Gb chips (though I may be wrong).
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,319
2,805
106
👍I think we will see GDDR6W in use in consumer GFX cards b/4 GDDR7. Also, pretty sure the GDDR6W's are 32 Gb chips (though I may be wrong).
They are 32gbit chips, but the problem is that IO is doubled.
As I understand It, using a 128-bit bus you will still be limited to same BW and 8GB Vram size.
The only advantage is that this 32gbit GDDR6w chip has the same or similar size as a 16gbit GDDR6 chip, so you need only 1/2 of memory chips. This would be very useful in laptops.
If AMD or Nvidia doubled or at least increased memory controller width by 50% then yes, It would provide increased capacity along with higher BW with the same or fewer amount of chips.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,332
7,789
136
They are 32gbit chips, but the problem is that IO is doubled.
As I understand It, using a 128-bit bus you will still be limited to same BW and 8GB Vram size.
The only advantage is that this 32gbit GDDR6w chip has the same or similar size as a 16gbit GDDR6 chip, so you need only 1/2 of memory chips. This would be very useful in laptops.
If AMD or Nvidia doubled or at least increased memory controller width by 50% then yes, It would provide increased capacity along with higher BW with the same or fewer amount of chips.
Ah, I see. So, what is the benefit of GDDR6W?? Cost and component area? Bummer, from a desktop perspective.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,196
5,197
136
The question was how to get 512bit GPU to work with GDDR6, not why it should makes sense 乁⁠|⁠ ⁠・⁠ ⁠〰⁠ ⁠・⁠ ⁠|⁠ㄏ

512bit can work with GDDR6, just like it worked with GDDR5 (about same size and only marginally more pins). You don't need GDDR6W to get a 512 bit bus working.

There simply is no need/desire for 512bit bus anymore.
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,319
2,805
106
Ah, I see. So, what is the benefit of GDDR6W?? Cost and component area? Bummer, from a desktop perspective.
With current GPUs you only save some space on PCB and cost for PCB and likely this chip is cheaper than 2 GDDR6 ones.
For example:
RTX 4060 8GB -> 4 GDDR6 chips or 2 GDDR6W chips
RTX 4090 24GB -> 12 GDDR6 chips or 6 GDDR6W chips
This doesn't sound like much and is mostly useful for laptops, where you have a limited space.

If AMD and Nvidia were willing to widen memory bus and GDDR7 was limited to 16Gbit chips or It wasn't released at the time, then It could be more interesting.
An example:
CU (WGP)ClockspeedMemory widthMemory TypeNumber of memory chipsVram in totalBandwidth
RX 760032 (16)2655MHz128-bit18gbps GDDR64*16Gbit8 GB288 GB/s
RX 8600v148 (24)
+50%
3151MHz +18.6%128-bit32gbps
GDDR7
4*16Gbit8 GB512 GB/s
+78%
RX 8600v248 (24)
+50%
3540MHz +33.3%256-bit18gbps GDDR6W4*32Gbit16 GB576 GB/s
+100%
P.S. I increased TFLOPs based on the increase in BW for both version 1 and 2.

As shown above, this GDDR6W could have some good advantages, especially in laptops.
Advantages over GDDR7:
1.) Likely the PCB size wouldn't change
2.) 2x Vram If GDDR7 is still limited to 16gbit chips
3.) Higher BW
4.) maybe price of GDDR6W would be cheaper or the same as GDDR7

Disadvantages over GDDR7:
1.) wider memory bus would probably result in higher power consumption
2.) bigger GPU because of extra 128-bit PHY, actually that's extra 25-30mm2 at 6nm maybe? This shouldn't affect GPU package size much, which would increase PCB size.


Not sure If I forgot something.
 
Last edited:

Heartbreaker

Diamond Member
Apr 3, 2006
4,196
5,197
136
With current GPUs you only save some space on PCB and cost for PCB and likely this chip is cheaper than 2 GDDR6 ones.

Likely as a specialized part, it ends up more expensive than two equivalent GDDR6 parts. That's the problem with it not being in competition. You can buy regular GDDR6 from Micron, Samsung, Hynix, so far only Samsung is talking GDDR6W.

For example:
RTX 4060 8GB -> 4 GDDR6 chips or 2 GDDR6W chips
RTX 4090 24GB -> 12 GDDR6 chips or 6 GDDR6W chips
This doesn't sound like much and is mostly useful for laptops, where you have a limited space.

If AMD and Nvidia were willing to widen memory bus and GDDR7 was limited to 16Gbit chips or It wasn't released at the time, then It could be more interesting.
An example:
CU (WGP)ClockspeedMemory widthMemory TypeNumber of memory chipsVram in totalBandwidth
RX 760032 (16)2655MHz128-bit18gbps GDDR64*16Gbit8 GB288 GB/s
RX 8600v148 (24)
+50%
3151MHz +18.6%128-bit32gbps
GDDR7
4*16Gbit8 GB512 GB/s
+78%
RX 8600v248 (24)
+50%
3540MHz +33.3%256-bit18gbps GDDR6W4*32Gbit16 GB576 GB/s
+100%
P.S. I increased TFLOPs based on the increase in BW for both version 1 and 2.

As shown above, this GDDR6W could have some good advantages, especially in laptops.
Advantages over GDDR7:
1.) Likely the PCB size wouldn't change
2.) 2x Vram If GDDR7 is still limited to 16gbit chips
3.) Higher BW
4.) maybe price of GDDR6W would be cheaper or the same as GDDR7

Disadvantages over GDDR7:
1.) wider memory bus would probably result in higher power consumption
2.) bigger GPU because of extra 128-bit PHY, actually that's extra 25-30mm2 at 6nm maybe? This shouldn't affect GPU package size much, which would increase PCB size.

Not sure If I forgot something.

None of those are advantages vs still just using regular GDDR6. "W" is still just slightly easier packaging. Everything else is still the same GDDR6 vs GDDR6W. It's not an incentive to double the bus, because the biggest expense of a wider bus is on the GPU side, as it uses significant GPU silicon area to implement. "W" doesn't change that.
 

SteinFG

Senior member
Dec 29, 2021
386
445
106
Honestly my idea is that GDDR_W will be good in making GPUs more compact. Have you seen reference 4060 PCB?

1690217086141.png

Imagine this size of PCB, but for something like 4080. With VRAM on-package. something like this, here's a quick paint mockup:
1690217720155.png
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,319
2,805
106
Likely as a specialized part, it ends up more expensive than two equivalent GDDR6 parts. That's the problem with it not being in competition. You can buy regular GDDR6 from Micron, Samsung, Hynix, so far only Samsung is talking GDDR6W.
In laptops, It could be still worth It for OEMs, even If It ends up costlier, maybe.

None of those are advantages vs still just using regular GDDR6. "W" is still just slightly easier packaging. Everything else is still the same GDDR6 vs GDDR6W. It's not an incentive to double the bus, because the biggest expense of a wider bus is on the GPU side, as it uses significant GPU silicon area to implement. "W" doesn't change that.
You are right. For higher BW you would still widen the bus and using GDDR6W wouldn't change anything except saving PCB space compared to GDDR6. Actually, GDDR7 would be preferable with a narrower bus, especially If they made 32Gbit chips.
 
  • Like
Reactions: Tlh97