Discussion RDNA4 + CDNA3 Architectures Thread

Page 55 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,599
5,762
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Tigerick

Senior member
Apr 1, 2022
638
521
106
The amount of CU is not a problem per se.
It's only a problem If clocks or IPC stays the same, that was the problem with RX 7600 along with the small benefit from dual-issue.

If this 32CU RDNA GPU allows higher clockspeed by 25%, 2655Mhz * 1.25 = ~3325MHz, would you care that It has only 32CU?
I wouldn't If It is sold for a reasonable price and not just 8GB Vram.

P.S. But considering Strix Halo has 40CU, I would expect at least the same amount for the weaker chip.

edit:
Maybe It's not impossible to pack 40CU in a 160mm2 chip and the bigger one will be 250mm2.
I mean 40CU,2560SP,160TMU,64ROPs,32MB,128-bit GDDR7 for the smaller one.
Clock It to 3.5GHz and you should be close to 7700XT, of course with 128-bit 30-32gbps GDDR7. :D
BTW, I got ~220mm2 for RX 7600 with 40CU using 6nm.
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,348
2,843
106
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.
Of course, they(AMD or Nvidia) want to clock the next gen higher, because then they can save on the execution units and reduce BOM by having a smaller chip.

AD107 for example could never have such a small size and keep that level of performance, If It was not clocked much higher than Ampere.
 

tajoh111

Senior member
Mar 28, 2005
298
312
136
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....
No way is RDNA 4 launching at CES 2024. Way too soon after the launch of RDNA3 and not enough solid leaks to indicate it is just a month close to launch. Board partners would have the card and we would be getting more leaks.

People are way way too optimistic on AMD release cadence. Just look at the first post of this thread for example. MI300 shortly after or even before H100 launch?

A 250mm2 or less die does not need gddr7. Particularly a 4nm one. 256bit bus GDDR6 with 24 Gbps already provides 768GB/ps of bandwidth. That's more than enough considering this is 50% more than a 6900xt. GDDR7 is coming at the end of 2024 and with RDNA 3 launching hopefully before that, the timing or cost does not make sense for RDNA4.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,348
2,843
106
A 250mm2 or less die does not need gddr7. Particularly a 4nm one. 256bit bus GDDR6 with 24 Gbps already provides 768GB/ps of bandwidth. That's more than enough considering this is 50% more than a 6900xt. GDDR7 is coming at the end of 2024 and with RDNA 3 launching hopefully before that, the timing or cost does not make sense for RDNA4.
A 250mm2 die will certainly not use a 256-bit bus.
192-bit bus even with 24gbps GDDR6 provides only 576GB/s, which is only comparable to RX 7900 GRE, RX 7800XT has 624GB/s.
Let's not forget that this chip should supposedly perform better than the above-mentioned GPUs.
 
  • Like
Reactions: Joe NYC and Tlh97

Hitman928

Diamond Member
Apr 15, 2012
5,225
7,736
136
Based upon AMD's presentation, MI300X is faster than H100 1:1 and scaling is significantly better with MI300X thanks to the larger memory capacity and bandwidth.

Based on Nvidia's H200 announcement, the H200 will be faster in compute but MI300X will still have an advantage in memory bandwidth and especially capacity. This sets up an interesting comparison as I believe Nvidia will be able to claim significantly higher performance with H200 on a 1:1 basis, but at scale, I'm not sure they will maintain that lead. This is really not my world though so I'll let end it there and let others with more knowledge here give their thoughts.
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,040
2,610
96
the H200 will be faster in compute
Nah it's just a mem capacity and b/w bump using unobtanium HBM3e's.
I believe Nvidia will be able to claim significantly higher performance with H200 on a 1:1 basis
Single-GPU perf has been irrelevant for years and NV knows itself and always notes their 8/32/128 GPU scaling.
 

Hitman928

Diamond Member
Apr 15, 2012
5,225
7,736
136
Nah it's just a mem capacity and b/w bump using unobtanium HBM3e's.

Single-GPU perf has been irrelevant for years and NV knows itself and always notes their 8/32/128 GPU scaling.

So how does NV get up to 1.9x speedup from H100 to H200? Again, not my world so I am very much looking at things peripherally so far and just going off of vendor charts. Looking at the actual compute numbers, you are right though, H200 and H100 are the same and well behind MI300 in traditional fp32/fp64. Only difference I see now in the chart below is H100 has a different "BS" number but I have no idea what that is. Anyone know what they are actually comparing here?

1701891111591.png

 

Saylick

Diamond Member
Sep 10, 2012
3,114
6,260
136
So how does NV get up to 1.9x speedup from H100 to H200? Again, not my world so I am very much looking at things peripherally so far and just going off of vendor charts. Looking at the actual compute numbers, you are right though, H200 and H100 are the same and well behind MI300 in traditional fp32/fp64. Only difference I see now in the chart below is H100 has a different "BS" number but I have no idea what that is. Anyone know what they are actually comparing here?

View attachment 89977

As @adroc_thurston pointed out and now looking at the data sheets, it shows H100 and H200 have the exact same compute numbers, so any speedup will only appear when memory bandwidth/capacity is the bottleneck as far as I can tell.
Correct. Those footnotes seem to suggest H200 uses a batch size (BS?) that is twice as large as for H100, so my guess is that Hopper isn't constrained by compute for GPT-3 and Llamma2 70B but memory bandwidth/capacity here.
 

Timorous

Golden Member
Oct 27, 2008
1,595
2,707
136
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....

RDNA3 clocks very high if you give it the juice so design wise the potential is there, just need to do it at lower power draw which we have seen before plenty of times.
 
  • Like
Reactions: Tlh97

Hitman928

Diamond Member
Apr 15, 2012
5,225
7,736
136
Correct. Those footnotes seem to suggest H200 uses a batch size (BS?) that is twice as large as for H100, so my guess is that Hopper isn't constrained by compute for GPT-3 and Llamma2 70B but memory bandwidth/capacity here.

Yes, looking into it, I believe it is batch size. Not sure Nvidia's comparison is honest here or if they are purposefully kind of gimping H100 in their comparison to make H200 look better. H100 has 57% of the memory capacity and 70% of the bandwidth of H200, so giving H100 a batch size of 1/4 the H200 seems unnecessary but I honestly have no idea if it's legit or not. I also didn't see any listed options in AMD's comparison between H100 and MI300X outside of it being a 70b parameter model.
 

Timorous

Golden Member
Oct 27, 2008
1,595
2,707
136
A 250mm2 die will certainly not use a 256-bit bus.
192-bit bus even with 24gbps GDDR6 provides only 576GB/s, which is only comparable to RX 7900 GRE, RX 7800XT has 624GB/s.
Let's not forget that this chip should supposedly perform better than the above-mentioned GPUs.

GDDR7 will also have 3gb modules so 192 bit can support 18GB from 6 chips and 128 but could support 12GB from 4 chips.
 

MrTeal

Diamond Member
Dec 7, 2003
3,567
1,686
136
We'll have to see how many 3GB modules actually get made. 1.5GB ones were part of the GDDR6 spec as well, but never ended up getting produced.
It might actually work out this time though if they can't economically produce 4GB ones and are stuck with 2GB and 3GB for at least awhile.
 
  • Like
Reactions: TESKATLIPOKA

Hitman928

Diamond Member
Apr 15, 2012
5,225
7,736
136
Just to follow-up on myself, according to the tests I've found (e.g. https://infohub.delltechnologies.co...using-low-rank-adaptation-lora-on-single-gpu/), the memory usage increase to batch size increase is less than 1:1. Meaning, you don't need 4x the memory to increase the batch size by 4x. Also, increasing the batch size has a very large impact on performance, assuming you have the memory to support it. Based upon this and my basic understanding, Nvidia is pulling a bit of a fast one with their H200 to H100 comparison. They are arbitrarily limiting the H100 batch size to make the H200 performance increase look better than it is. It will be interesting to see a direct 1:1 comparison between MI300X and H200 as well as scalability.

Note: just pointing out again that I am learning on the fly here so anyone can feel free to critique my findings.
 

Hitman928

Diamond Member
Apr 15, 2012
5,225
7,736
136
Servethehome has some additional press slides with hard numbers. Based upon this, MI300X has higher performance than H100/H200 in every aspect, at least theoretically. I'm guessing AMD used the same batch sizes when they compared MI300X to H100 in their presentation as the performance increase they showed correlates well to the numbers below. Had they limited H100's batch size as Nvidia did, MI300X would probably have been over 2x faster.

AMD-Instinct-MI300X-to-H100-Spec-Comparison.jpg


 

Saylick

Diamond Member
Sep 10, 2012
3,114
6,260
136
Just to follow-up on myself, according to the tests I've found (e.g. https://infohub.delltechnologies.co...using-low-rank-adaptation-lora-on-single-gpu/), the memory usage increase to batch size increase is less than 1:1. Meaning, you don't need 4x the memory to increase the batch size by 4x. Also, increasing the batch size has a very large impact on performance, assuming you have the memory to support it. Based upon this and my basic understanding, Nvidia is pulling a bit of a fast one with their H200 to H100 comparison. They are arbitrarily limiting the H100 batch size to make the H200 performance increase look better than it is. It will be interesting to see a direct 1:1 comparison between MI300X and H200 as well as scalability.

Note: just pointing out again that I am learning on the fly here so anyone can feel free to critique my findings.
Thanks for doing some research. As always with marketing slides, take with a grain of salt.
 

Saylick

Diamond Member
Sep 10, 2012
3,114
6,260
136
Servethehome has some additional press slides with hard numbers. Based upon this, MI300X has higher performance than H100/H200 in every aspect, at least theoretically:

AMD-Instinct-MI300X-to-H100-Spec-Comparison.jpg


That memory capacity and bandwidth advantage ought to be clutch, now that I know how Nvidia determined their H200 inferencing performance over H100.
 
  • Like
Reactions: lightmanek

jpiniero

Lifer
Oct 1, 2010
14,562
5,194
136
We'll have to see how many 3GB modules actually get made. 1.5GB ones were part of the GDDR6 spec as well, but never ended up getting produced.
It might actually work out this time though if they can't economically produce 4GB ones and are stuck with 2GB and 3GB for at least awhile.

Micron's roadmap does suggest capacities beyond 3 GB in 2026.....
 

Abwx

Lifer
Apr 2, 2011
10,922
3,413
136