Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Tigerick · Dec 6, 2023

TESKATLIPOKA said:
The amount of CU is not a problem per se.
It's only a problem If clocks or IPC stays the same, that was the problem with RX 7600 along with the small benefit from dual-issue.

If this 32CU RDNA GPU allows higher clockspeed by 25%, 2655Mhz * 1.25 = ~3325MHz, would you care that It has only 32CU?
I wouldn't If It is sold for a reasonable price and not just 8GB Vram.

P.S. But considering Strix Halo has 40CU, I would expect at least the same amount for the weaker chip.

edit:
Maybe It's not impossible to pack 40CU in a 160mm2 chip and the bigger one will be 250mm2.
I mean 40CU,2560SP,160TMU,64ROPs,32MB,128-bit GDDR7 for the smaller one.
Clock It to 3.5GHz and you should be close to 7700XT, of course with 128-bit 30-32gbps GDDR7. 😀
BTW, I got ~220mm2 for RX 7600 with 40CU using 6nm.

You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....

Hitman928 · Dec 6, 2023

Link to AMD's AI event stream:

adroc_thurston · Dec 6, 2023

Tigerick said:
You seems like to think every GPU will clock higher to get higher performance

Yea that's the plan.

TESKATLIPOKA · Dec 6, 2023

Tigerick said:
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Of course, they(AMD or Nvidia) want to clock the next gen higher, because then they can save on the execution units and reduce BOM by having a smaller chip.

AD107 for example could never have such a small size and keep that level of performance, If It was not clocked much higher than Ampere.

Saylick · Dec 6, 2023

I was looking forward to new info from AMD's AI event, but it looks like they have only covered MI300X so far... I want to know about MI300C and MI300A! :colbert:

The AMD Advancing AI & Instinct MI300 Launch Live Blog (Starts at 10am PT/18:00 UTC)

www.anandtech.com

tajoh111 · Dec 6, 2023

Tigerick said:
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....

No way is RDNA 4 launching at CES 2024. Way too soon after the launch of RDNA3 and not enough solid leaks to indicate it is just a month close to launch. Board partners would have the card and we would be getting more leaks.

People are way way too optimistic on AMD release cadence. Just look at the first post of this thread for example. MI300 shortly after or even before H100 launch?

A 250mm2 or less die does not need gddr7. Particularly a 4nm one. 256bit bus GDDR6 with 24 Gbps already provides 768GB/ps of bandwidth. That's more than enough considering this is 50% more than a 6900xt. GDDR7 is coming at the end of 2024 and with RDNA 3 launching hopefully before that, the timing or cost does not make sense for RDNA4.

TESKATLIPOKA · Dec 6, 2023

tajoh111 said:
A 250mm2 or less die does not need gddr7. Particularly a 4nm one. 256bit bus GDDR6 with 24 Gbps already provides 768GB/ps of bandwidth. That's more than enough considering this is 50% more than a 6900xt. GDDR7 is coming at the end of 2024 and with RDNA 3 launching hopefully before that, the timing or cost does not make sense for RDNA4.

A 250mm2 die will certainly not use a 256-bit bus.
192-bit bus even with 24gbps GDDR6 provides only 576GB/s, which is only comparable to RX 7900 GRE, RX 7800XT has 624GB/s.
Let's not forget that this chip should supposedly perform better than the above-mentioned GPUs.

adroc_thurston · Dec 6, 2023

aieeeeeeeee xGMI switches I kneel

Hitman928 · Dec 6, 2023

Based upon AMD's presentation, MI300X is faster than H100 1:1 and scaling is significantly better with MI300X thanks to the larger memory capacity and bandwidth.

Based on Nvidia's H200 announcement, the H200 will be faster in compute but MI300X will still have an advantage in memory bandwidth and especially capacity. This sets up an interesting comparison as I believe Nvidia will be able to claim significantly higher performance with H200 on a 1:1 basis, but at scale, I'm not sure they will maintain that lead. This is really not my world though so I'll let end it there and let others with more knowledge here give their thoughts.

adroc_thurston · Dec 6, 2023

Hitman928 said:
the H200 will be faster in compute

Nah it's just a mem capacity and b/w bump using unobtanium HBM3e's.

Hitman928 said:
I believe Nvidia will be able to claim significantly higher performance with H200 on a 1:1 basis

Single-GPU perf has been irrelevant for years and NV knows itself and always notes their 8/32/128 GPU scaling.

gdansk · Dec 6, 2023

Is H200 faster than H100 except in memory-limited scenarios?

Hitman928 · Dec 6, 2023

adroc_thurston said:
Nah it's just a mem capacity and b/w bump using unobtanium HBM3e's.

Single-GPU perf has been irrelevant for years and NV knows itself and always notes their 8/32/128 GPU scaling.

So how does NV get up to 1.9x speedup from H100 to H200? Again, not my world so I am very much looking at things peripherally so far and just going off of vendor charts. Looking at the actual compute numbers, you are right though, H200 and H100 are the same and well behind MI300 in traditional fp32/fp64. Only difference I see now in the chart below is H100 has a different "BS" number but I have no idea what that is. Anyone know what they are actually comparing here?

NVIDIA H200 GPU

For supercharging AI and HPC workloads.

www.nvidia.com

Hitman928 · Dec 6, 2023

gdansk said:
Is H200 faster than H100 except in memory-limited scenarios?

As @adroc_thurston pointed out and now looking at the data sheets, it shows H100 and H200 have the exact same compute numbers, so any speedup will only appear when memory bandwidth/capacity is the bottleneck as far as I can tell.

Saylick · Dec 6, 2023

Hitman928 said:
So how does NV get up to 1.9x speedup from H100 to H200? Again, not my world so I am very much looking at things peripherally so far and just going off of vendor charts. Looking at the actual compute numbers, you are right though, H200 and H100 are the same and well behind MI300 in traditional fp32/fp64. Only difference I see now in the chart below is H100 has a different "BS" number but I have no idea what that is. Anyone know what they are actually comparing here?

View attachment 89977

NVIDIA H200 GPU

For supercharging AI and HPC workloads.

www.nvidia.com

Hitman928 said:
As @adroc_thurston pointed out and now looking at the data sheets, it shows H100 and H200 have the exact same compute numbers, so any speedup will only appear when memory bandwidth/capacity is the bottleneck as far as I can tell.

Correct. Those footnotes seem to suggest H200 uses a batch size (BS?) that is twice as large as for H100, so my guess is that Hopper isn't constrained by compute for GPT-3 and Llamma2 70B but memory bandwidth/capacity here.

Timorous · Dec 6, 2023

Tigerick said:
You seems like to think every GPU will clock higher to get higher performance. I would say yes for high end GPU but not really for mainstream GPU. In order to drive the BOM down with newer process node (thus higher BOM), most of the newer process node GPU like AD107, they will use tiny die size combined with simpler PCB with lesser VRM and Mosfet to drive the power down to like 115W. You can check the PCB design of RX7600 (165W) and RTX4060 (115W), you will see how RDNA4 GPU going to be design like.

Does anyone know whether any X leakers mentioned that GDDR7 will be bundled with RDNA4???? Cause if AMD is prepared to launch RDNA4 at CES2024, that's mean RDNA4 has been tape-out since end of 2022 with engineering sample available since end of 2021....

RDNA3 clocks very high if you give it the juice so design wise the potential is there, just need to do it at lower power draw which we have seen before plenty of times.

Hitman928 · Dec 6, 2023

Saylick said:
Correct. Those footnotes seem to suggest H200 uses a batch size (BS?) that is twice as large as for H100, so my guess is that Hopper isn't constrained by compute for GPT-3 and Llamma2 70B but memory bandwidth/capacity here.

Yes, looking into it, I believe it is batch size. Not sure Nvidia's comparison is honest here or if they are purposefully kind of gimping H100 in their comparison to make H200 look better. H100 has 57% of the memory capacity and 70% of the bandwidth of H200, so giving H100 a batch size of 1/4 the H200 seems unnecessary but I honestly have no idea if it's legit or not. I also didn't see any listed options in AMD's comparison between H100 and MI300X outside of it being a 70b parameter model.

Timorous · Dec 6, 2023

TESKATLIPOKA said:
A 250mm2 die will certainly not use a 256-bit bus.
192-bit bus even with 24gbps GDDR6 provides only 576GB/s, which is only comparable to RX 7900 GRE, RX 7800XT has 624GB/s.
Let's not forget that this chip should supposedly perform better than the above-mentioned GPUs.

GDDR7 will also have 3gb modules so 192 bit can support 18GB from 6 chips and 128 but could support 12GB from 4 chips.

MrTeal · Dec 6, 2023

We'll have to see how many 3GB modules actually get made. 1.5GB ones were part of the GDDR6 spec as well, but never ended up getting produced.
It might actually work out this time though if they can't economically produce 4GB ones and are stuck with 2GB and 3GB for at least awhile.

Hitman928 · Dec 6, 2023

Just to follow-up on myself, according to the tests I've found (e.g. https://infohub.delltechnologies.co...using-low-rank-adaptation-lora-on-single-gpu/), the memory usage increase to batch size increase is less than 1:1. Meaning, you don't need 4x the memory to increase the batch size by 4x. Also, increasing the batch size has a very large impact on performance, assuming you have the memory to support it. Based upon this and my basic understanding, Nvidia is pulling a bit of a fast one with their H200 to H100 comparison. They are arbitrarily limiting the H100 batch size to make the H200 performance increase look better than it is. It will be interesting to see a direct 1:1 comparison between MI300X and H200 as well as scalability.

Note: just pointing out again that I am learning on the fly here so anyone can feel free to critique my findings.

Hitman928 · Dec 6, 2023

Servethehome has some additional press slides with hard numbers. Based upon this, MI300X has higher performance than H100/H200 in every aspect, at least theoretically. I'm guessing AMD used the same batch sizes when they compared MI300X to H100 in their presentation as the performance increase they showed correlates well to the numbers below. Had they limited H100's batch size as Nvidia did, MI300X would probably have been over 2x faster.

AMD Instinct MI300X GPU and MI300A APUs Launched for AI Era

We delve into the new AMD Instinct MI300X GPU, MI300A APU, and see how AMD has built packages to go head-to-head with the NVIDIA H100 and win

www.servethehome.com

Saylick · Dec 6, 2023

Hitman928 said:
Just to follow-up on myself, according to the tests I've found (e.g. https://infohub.delltechnologies.co...using-low-rank-adaptation-lora-on-single-gpu/), the memory usage increase to batch size increase is less than 1:1. Meaning, you don't need 4x the memory to increase the batch size by 4x. Also, increasing the batch size has a very large impact on performance, assuming you have the memory to support it. Based upon this and my basic understanding, Nvidia is pulling a bit of a fast one with their H200 to H100 comparison. They are arbitrarily limiting the H100 batch size to make the H200 performance increase look better than it is. It will be interesting to see a direct 1:1 comparison between MI300X and H200 as well as scalability.

Note: just pointing out again that I am learning on the fly here so anyone can feel free to critique my findings.

Thanks for doing some research. As always with marketing slides, take with a grain of salt.

Saylick · Dec 6, 2023

Hitman928 said:
Servethehome has some additional press slides with hard numbers. Based upon this, MI300X has higher performance than H100/H200 in every aspect, at least theoretically:

AMD Instinct MI300X GPU and MI300A APUs Launched for AI Era

We delve into the new AMD Instinct MI300X GPU, MI300A APU, and see how AMD has built packages to go head-to-head with the NVIDIA H100 and win

www.servethehome.com

That memory capacity and bandwidth advantage ought to be clutch, now that I know how Nvidia determined their H200 inferencing performance over H100.

jpiniero · Dec 6, 2023

MrTeal said:
We'll have to see how many 3GB modules actually get made. 1.5GB ones were part of the GDDR6 spec as well, but never ended up getting produced.
It might actually work out this time though if they can't economically produce 4GB ones and are stuck with 2GB and 3GB for at least awhile.

Micron's roadmap does suggest capacities beyond 3 GB in 2026.....

adroc_thurston · Dec 6, 2023

Saylick said:
That memory capacity and bandwidth advantage ought to be clutch, now that I know how Nvidia determined their H200 inferencing performance over H100.

H200 literally exists as a damage control measure against AMD.
They're using unobtanium HBM3e bins and pre-announced a product to say yes they can

Abwx · Dec 6, 2023

Computerbase released all slides with numerous comparisons, they say thay had them since quite a time but were under NDA, as usual.

AMD MI300A & MI300X: Die neue Instinct-Serie ist ein Meilenstein in vielen Bereichen

Mit der neuen Produktreihe Instinct MI300A/MI300X geht AMD den Platzhirsch Nvidia an und will dank Modularität mitunter sogar mehr bieten.

www.computerbase.de

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Senior member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer