Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Mopetar · Jun 3, 2023

soresu said:
No FP4 or INT4?

I thought that it did have INT4, but I can't recall if that's just support or native hardware. Technically anything will support INT4 using the hardware it has, but there wouldn't be any speed up.

Is FP4 actually a thing because that seems incredibly pointless no matter how you try to arrange it. If it's signed you'd only have two bits for either the exponent or the mantissa and the other would get a single bit.

16-bit floating point (at least with the bfloat format) still gives a range essentially as large as F32 so it's still useful as long as you don't care as much about precision. Anything below that runs into the same issues that make IEEE FP16 format less desirable.

MrTeal · Jun 3, 2023

Mopetar said:
I thought that it did have INT4, but I can't recall if that's just support or native hardware. Technically anything will support INT4 using the hardware it has, but there wouldn't be any speed up.

Is FP4 actually a thing because that seems incredibly pointless no matter how you try to arrange it. If it's signed you'd only have two bits for either the exponent or the mantissa and the other would get a single bit.

16-bit floating point (at least with the bfloat format) still gives a range essentially as large as F32 so it's still useful as long as you don't care as much about precision. Anything below that runs into the same issues that make IEEE FP16 format less desirable.

The little reading I've done indicated best results are actually with a sign and 3 exponent and no bits for mantissa, but I think FP4 is more researchy than actually used at this time.

Mopetar · Jun 4, 2023

MrTeal said:
The little reading I've done indicated best results are actually with a sign and 3 exponent and no bits for mantissa, but I think FP4 is more researchy than actually used at this time.

Assuming you wanted FP4 to be able to store special values (infinities and 0) you'd effectively lose one bit of that exponent and you effectively get positive or negative numbers in the set {♾, 16, 8, 4, 2, 0, .5, .25} assuming I've calculated it correctly which I'll admit I may not have. There might be some issues due to not having a mantissa as I think some special values like NaN are represented in a specific way using those bits.

It seems like you could probably find a clever way of using INT4 to accomplish the same thing you'd be trying to do with such a small floating point number. I suppose having an infinity value might be useful as INT4 overflow could cause some nasty issues.

Saylick · Jun 12, 2023

Semianalysis has an article on MI300. Enjoy!

AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software

Amazing engineering, but what of the path to market?

www.semianalysis.com

Joe NYC · Jun 12, 2023

Saylick said:
Semianalysis has an article on MI300. Enjoy!

AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software

Amazing engineering, but what of the path to market?

www.semianalysis.com

Interesting info. MLID got nearly everything right in his original leak.

BTW, I am most curious about the status, when it is ready, when it is shipping etc. There were number of contradicting rumors / statements. Q4 2023, Q1 2024. And now, Dylan adds another one - that it is shipping now and ramping in Q3.

Saylick · Jun 12, 2023

Joe NYC said:
Interesting info. MLID got nearly everything right in his original leak.

BTW, I am most curious about the status, when it is ready, when it is shipping etc. There were number of contradicting rumors / statements. Q4 2023, Q1 2024. And now, Dylan adds another one - that it is shipping now and ramping in Q3.

I suspect you'll get a better answer to hose questions tomorrow at their AI event.

jamescox · Jun 12, 2023

Saylick said:
Semianalysis has an article on MI300. Enjoy!

AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software

Amazing engineering, but what of the path to market?

www.semianalysis.com

Some of the information here seems to be conflicting. It still seems unclear what the actual structure is.

beginner99 · Jun 13, 2023

Saylick said:
Semianalysis has an article on MI300. Enjoy!

AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software

Amazing engineering, but what of the path to market?

www.semianalysis.com

Marvel of engineering. Yes. But that doesn't help if it takes a team of experts devs to get the software running on it.

Nvidia has the market share because of software. And because you can just prototype things on your company provided windows laptop. good luck trying to get rocm running on windows. only change is via WSL and even then it's hit and miss. plus AFAIK normal laptop gpus aren't even supported to run rocm. no engineering marvel will fix that for AMD to get widespread adoption outside of supercomputers.

moinmoin · Jun 13, 2023

beginner99 said:
good luck trying to get rocm running on windows.

Seriously though who would want that on a professional production machine? I understand that Nvidia has the advantage of allowing client GPUs using gaming OSes to use professional grade software, but experts would never use either in production.

Ajay · Jun 13, 2023

jamescox said:
Some of the information here seems to be conflicting. It still seems unclear what the actual structure is.

Hmm, 1pm EST.

coercitiv · Jun 13, 2023

In case anyone is interested:

moinmoin · Jun 13, 2023

The part about CDNA3 finally started.

Saylick · Jun 13, 2023

Well, that was a damp squib.

jamescox · Jun 13, 2023

Saylick said:
Well, that was a damp squib.

Doesn't seem like much info on the package construction. They basically just confirmed the MI300A and MI300X products and that is about it. I skipped around a bit, but I don't think there was any mention of a cpu only version, but they have to connect a gpu only version to a cpu somehow. Would they have a mixed board with an SH5 and an SP5 socket?

With 896 GB/s infinity fabric bandwidth, would that be 4 x16 links at ~56 GB/s per base die?

igor_kavinski · Jun 13, 2023

Saylick said:
Well, that was a damp squib.

Was there any good reason to be excited about CDNA3? Sorry, didn't follow this thread too closely and kinda glossed over the enterprise/data center talk.

jamescox · Jun 13, 2023

igor_kavinski said:
Was there any good reason to be excited about CDNA3? Sorry, didn't follow this thread too closely and kinda glossed over the enterprise/data center talk.

If you aren’t a victim of nvidia vendor lock-in practices, then there is reason to be interested. Also, you obviously have to be interested in such hardware in the first place. This is isn’t a gaming gpu.

I am more interested in the packaging tech and how it could be applied elsewhere. I use systems with Nvidia GPUs and AMD CPUs at the moment. We may have to make a decision to port to ARM or drop cuda eventually. I suspect that nvidia’s arm cores will be rather weak compared to Zen 4 or Zen 5, so porting to them wouldn’t just be x86 to ARM, it would also probably mean making the code much more multi-threaded, which is not simple.

I have wondered if unified memory will reduce the porting effort significantly, if a lot of the low level memory management stuff goes away. I am not a gpu programmer though.

Joe NYC · Jun 14, 2023

jamescox said:
Doesn't seem like much info on the package construction. They basically just confirmed the MI300A and MI300X products and that is about it. I skipped around a bit, but I don't think there was any mention of a cpu only version, but they have to connect a gpu only version to a cpu somehow. Would they have a mixed board with an SH5 and an SP5 socket?

With 896 GB/s infinity fabric bandwidth, would that be 4 x16 links at ~56 GB/s per base die?

Really nothing new was said, other than Mi300x being a drop in replacement for H100, and confirmation of some widely leaked info.

It seems that this was more of a Bergamo and Genoa-X launch event, and another Mi300 tease event. The actual launch of Mi300 - maybe sometimes late Q3, early Q4 - will have the more detailed technical info we were looking for.

Joe NYC · Jun 14, 2023

jamescox said:
Would they have a mixed board with an SH5 and an SP5 socket?

Mi300x is actually OCP socket.

So like a lot of the AI system out there featuring 2 CPUs and 8 GPUs, AMD platform will have 2 Genoa CPUs, 8 OCP Mi300x GPUs, and no socket SH5.

Mi300a and Mi300c will be SH5 socket.

BorisTheBlade82 · Jun 14, 2023

Saylick said:
Well, that was a damp squib.

Yep, I am a bit disappointed as well. Would have hoped for more technical background information as well.
But make no mistake: This is the moment, where AMD brings the fight to Nvidia on this huge and ever increasing market called AI. Just bought some of their shares after they dropped by around 5%.

Ajay · Jun 14, 2023

Joe NYC said:
Really nothing new was said, other than Mi300x being a drop in replacement for H100, and confirmation of some widely leaked info.

It seems that this was more of a Bergamo and Genoa-X launch event, and another Mi300 tease event. The actual launch of Mi300 - maybe sometimes late Q3, early Q4 - will have the more detailed technical info we were looking for.

It’s not a 'drop in' solution as Nvidia uses a different connector and fabric which are proprietary, AFAIK. Also, not CUDA, which isn’t a problem for supercomputers, but will be for workstations and small server setups.

DeathReborn · Jun 15, 2023

Ajay said:
It’s not a 'drop in' solution as Nvidia uses a different connector and fabric which are proprietary, AFAIK. Also, not CUDA, which isn’t a problem for supercomputers, but will be for workstations and small server setups.

Yea, this isn't a drop in replacement at all, it's an almost complete start from scratch in both HW & SW.

beginner99 · Jun 15, 2023

moinmoin said:
Seriously though who would want that on a professional production machine? I understand that Nvidia has the advantage of allowing client GPUs using gaming OSes to use professional grade software, but experts would never use either in production

Not for production but for development. if you can't do it on your laptop, you need yet another infrastructure/server for development which means higher cost.

Joe NYC · Jun 15, 2023

Ajay said:
It’s not a 'drop in' solution as Nvidia uses a different connector and fabric which are proprietary, AFAIK. Also, not CUDA, which isn’t a problem for supercomputers, but will be for workstations and small server setups.

From hardware POV, it is the It is the same form factor - OAM.

moinmoin · Jun 15, 2023

beginner99 said:
Not for production but for development. if you can't do it on your laptop, you need yet another infrastructure/server for development which means higher cost.

Let's remember that Nvidia's approach ensured that gaming hardware was perfectly usable for mining with the predictable results on the market. Current development on ROCm appears to imply AMD intends to support up to last gen consumer graphics in ROCm while current gen will have to wait.

PJVol · Jun 15, 2023

Saylick said:
Semianalysis has an article on MI300. Enjoy!

I'd rather enjoy someone TLDR-ed here the rest of that semi-article, with all due respect to its semi-authors
or PM'ed me all - strictly anonymous ))

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Senior member

Lifer

Senior member

Golden Member

Golden Member

Senior member

Lifer

Platinum Member

Diamond Member

Golden Member

Diamond Member

Senior member