Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Timorous · Apr 27, 2024

adroc_thurston said:
Just RDNA4 things really.

Things we will see in N44/N48 or things that were unique to the stacked parts?

Joe NYC · Apr 27, 2024

adroc_thurston said:
It was above 200.
A fair bit actually.

Oh no!

adroc_thurston · Apr 27, 2024

Joe NYC said:
Oh no!

nontent man strikes again

maddie · Apr 27, 2024

adroc_thurston said:
nontent man strikes again

Now this post is definitely not getting shown.

Mahboi · Apr 27, 2024

Joe NYC said:
Oh no!

RGT

I mean, let's look at the positive.
Since they all go and use YT servers to read the forums, at least the forums get less pressure.
(yesterday was pretty awful)

Mahboi · Apr 27, 2024

Wait hold on here, his video is a goldmine of an IQ test.

600% increase in RT to be on par with Blackwell.
You heard it at RGT's first!

"Intel is just better as Alchemist has proven"
"AI doesn't care about latency so just kill the latency with your latency heavy chiplets"

Someone who as a real 6900xt that really exists and as no reason to buy anything but Nvidia but as a 6900 xt because here is no reason to buy high end AMD.
He could stand to ave a english lesson too.

Before someone says "it's lowbrow to pick and mock them", hey they started picking things here first, it's only fair. And funny.

CouncilorIrissa · Apr 27, 2024

Mahboi said:
"Intel is just better as Alchemist has proven"

It only requires twice the die size to run at the same perf as the 7600. And only in games that it can actually run. kek

Mahboi · Apr 27, 2024

CouncilorIrissa said:
It only requires twice the die size to run at the same perf as the 7600. And only in games that it can actually run. kek

Exactly! Total success and absolute proof of superiority over the AMDumbs.

CouncilorIrissa · Apr 27, 2024

Joe NYC said:
Oh no!

View attachment 97973

Screenshots of certain b3d threads coming up next.

gaav87 · Apr 27, 2024

Timorous said:
Shader engines. N44 has 2 N48 has 4. The code shown has SEs 0 through 8 for 9 SEs which would be a 144CU (or 180 if they had the same layout as N32) design and would fit with the 3 GCDs of 3SEs each that was rumoured for N4C ages ago.

Thats interesting.
But you can't be sure N48 has 4 as recent leaks pointed it having 50TFLOPs of FP32.
9(vector units) x 32(wide) x 64cu x clock speed /1,000,000 = 50
Means, a boost clock speed of ~2713mhz

vs (rdna 3 design with dual issue)
4 (vu) x 2 x 32(wide) x 64cu x clock speed /1,000,000=50
~3050mhz boost clock

I would lean on ~2700mhz reference boost clock (200-300mhz higher then rdna3) vs 3050mhz (crazy clock speed)
Also this would mean they ditched dual issue fp32 and went with similar design to intel bmg with more thread count for literally 2x 6950xt compute performance.
But idk how did they manage to fit that on ~255mm2 so maybe i am talking bs.

All_The_Watts leaked:
N48
+50TFLOPs fp32
32
64
256
693
2770
~240 mm²

Timorous · Apr 27, 2024

gaav87 said:
Thats interesting.
But you can't be sure N48 has 4 as recent leaks pointed it having 50TFLOPs of FP32.
9(vector units) x 32(wide) x 64cu x clock speed /1,000,000 = 50
Means, a boost clock speed of ~2713mhz

vs (rdna 3 design with dual issue)
4 (vu) x 2 x 32(wide) x 64cu x clock speed /1,000,000=50
~3050mhz boost clock

I would lean on ~2700mhz reference boost clock (200-300mhz higher then rdna3) vs 3050mhz (crazy clock speed)
Also this would mean they ditched dual issue fp32 and went with similar design to intel bmg with more thread count for literally 2x 6950xt compute performance.
But idk how did they manage to fit that on ~255mm2 so maybe i am talking bs.

All_The_Watts leaked:
N48
+50TFLOPs fp32
32
64
256
693
2770
~240 mm²

2770 is effective bandwidth if you account for infinity cache. Or was, if the memspec is 18gbps then the bandwidth and effective bandwidth both drop from those rumours.

branch_suggestion · Apr 27, 2024

gaav87 said:
Thats interesting.
But you can't be sure N48 has 4 as recent leaks pointed it having 50TFLOPs of FP32.
9(vector units) x 32(wide) x 64cu x clock speed /1,000,000 = 50
Means, a boost clock speed of ~2713mhz

vs (rdna 3 design with dual issue)
4 (vu) x 2 x 32(wide) x 64cu x clock speed /1,000,000=50
~3050mhz boost clock

I would lean on ~2700mhz reference boost clock (200-300mhz higher then rdna3) vs 3050mhz (crazy clock speed)
Also this would mean they ditched dual issue fp32 and went with similar design to intel bmg with more thread count for literally 2x 6950xt compute performance.
But idk how did they manage to fit that on ~255mm2 so maybe i am talking bs.

All_The_Watts leaked:
N48
+50TFLOPs fp32
32
64
256
693
2770
~240 mm²

But it is ((128ALU*64CU*3050Mhz)*2)/1000000 = ~50TF
RDNA3 missed clock targets at TBP significantly. 64CU only makes sense for 2 or 4SE, you will see the 2SE version more or less with the PS5 Pro.
Not much longer now until we see the true clock speeds realised.

gaav87 · Apr 28, 2024

branch_suggestion said:
But it is ((128ALU*64CU*3050Mhz)*2)/1000000 = ~50TF
RDNA3 missed clock targets at TBP significantly. 64CU only makes sense for 2 or 4SE, you will see the 2SE version more or less with the PS5 Pro.
Not much longer now until we see the true clock speeds realised.

Im talking most basic fundamental level. Amd executes workloads in groups of 32 threads(32wide) each capable of same result calculation on port0 + port1 (dual issue).
16 wave per SIMD(VALU)
4 SIMD per WGP

1 (INT32 Ops per Clock (ADD) x16 (wave) x 4(simd) x 64cu x 3050mhz )x2 (dual issue) =25 TFLOPs INT32
2 (FP32 FLOPs per Clock (MAD) x16 (wave) x 4(simd) x [32WGP (2 wgp per cu)=64CU] x 3050mhz )x2 (dual issue) =50 TFLOPs FP32
=4096(VALU)
Each 1 VALU can do 2 fp32 FLOPs per clock or 1 int32 ops per clock.

They either
- increased wave count from 16 to 32
- Made the simd 64 wide instead of 32 (simmilar to GCN)
- increased WGP per shader engine from 8 to 9
- Got rid of dual issue on port0 and port1
All this points to ~2700-2750mhz boost

Or left it as in rdna3 and the clock is 3050mhz but its highly unlikely too high boost clock.

Or the 50TFLOPs leak is just bs

branch_suggestion · Apr 28, 2024

gaav87 said:
Or left it as in rdna3 and the clock is 3050mhz but its highly unlikely too high boost clock.

Why do you think that?
You can up the power limit on any RDNA3 part right now and easily go over 3Ghz in games, so it really isn't that far fetched. Also look at RDNA3 slideware, they talked about 3Ghz+ many times.

gaav87 · Apr 28, 2024

branch_suggestion said:
Why do you think that?
You can up the power limit on any RDNA3 part right now and easily go over 3Ghz in games, so it really isn't that far fetched. Also look at RDNA3 slideware, they talked about 3Ghz+ many times.

It was stated that 50+ TFLOPs fp32 at 230W TBP

branch_suggestion · Apr 28, 2024

gaav87 said:
It was stated that 50+ TFLOPs fp32 at 230W TBP

Which is not unrealistic, it is a new uArch and has a nodelet bump.
Throw in RDNA3 being a low barrier to beat due to clocks eating power and it is possible.
Scale off Ada and it also checks out.

CouncilorIrissa · Apr 28, 2024

https://twitter.com/x/status/1784561456694046744

Quite sad to see the N40 and N41 canned. Looks like an ambitious design to say the least.

Mahboi · Apr 28, 2024

Navi WGPs Bus GDDR Generation

Wut?
The second column has to be workgroups.
RDNA 4 aimed at 144 WGPs and 5 for more than that? And more than 384 bit bus?

gaav87 · Apr 28, 2024

branch_suggestion said:
Which is not unrealistic, it is a new uArch and has a nodelet bump.
Throw in RDNA3 being a low barrier to beat due to clocks eating power and it is possible.
Scale off Ada and it also checks out.

We do not know that. But i think its more likely then 3050mhz clock that amd increased per-SIMD matrix multiplication. Maybe they changed simd width or increased simd count per wgp leaving wave slots at 16 or lowered wave slots even more they went from 20 to 16 rdna2->rdna3 but each simd can operate 32 slots this would make sense that they introduced SWMMAC. Lowering wave slots would lower occupancy lowering latency and increasing performance.

TESKATLIPOKA · Apr 28, 2024

I find It unrealistic that N41 with 3x more WGPs was planned with the same bus width as N48.
Yes, It uses GDDR7, but that will provide ~50-60% more BW compared to GDDR6.

Tigerick · Apr 28, 2024

CouncilorIrissa said:
https://twitter.com/x/status/1784561456694046744

Quite sad to see the N40 and N41 canned. Looks like an ambitious design to say the least.

Let's try to combine RDNA4 and RDNA5 in the table below:

Codename	Base Die	WGP	CU	GDDR6	Memory Speed	Memory BW
RDNA4
N44	1	16	32	16GB 128-BIT	18 Gbps	288 GB/s
N48	1	32	64	16GB 256-BIT	18 Gbps	576 GB/s

RDNA5				GDDR7
N52	2	96	192	16GB 256-BIT	32 Gbps	1024 GB/s
N51	3	144	288	24GB 384-BIT	32 Gbps	1536 GB/s
N50	4	192	384	32GB 512-BIT	32 Gbps	2048 GB/s

I assumed upcoming N51 and N52 are replacing N41 and N42 with updated design and architecture. I also assume N52 is using two base dies with 128-BIT GDDR7 memory bus each. Will update if any new info appears...
There are power and bandwidth penalty with chiplet design. That's mean game clock might not able to clock as high as monolithics design. That's why I expect AMD will use 32Gbps GDDR7.
That might explain why AMD is putting 3 times CU with 78% extra memory bandwidth. Also remember AMD might be binning the die for N52.
Remember upcoming RTX5090 comes with 512-bit memory bus as well, and that's the target of N50. RTX5090 comes with 192 SM which is similar to N50's 192 WGP. Of course, NV and AMD might not be using full die but the similarities amount of SM and WGP is interesting.

SolidQ · Apr 28, 2024

Tigerick said:
RTX5090 comes with 192 SM which is similar to N50's 192 WGP. Of course,

Not similar
RX 7600 16WGP - RTX 4060 24SM
RX 7800XT 30WGP - RTX 4070 46SM/4070Super 56SM
RX 7900XT 42WGP - RTX 4070ti 60SM/RTX 4070ti Super 66SM
RX 7900XTX 48WGP - RTX 4080 76SM/RTX 4080super 80SM

Tigerick · Apr 28, 2024

SolidQ said:
Not similar
RX 7600 16WGP - RTX 4060 24SM
RX 7800XT 30WGP - RTX 4070 46SM/4070Super 56SM
RX 7900XT 42WGP - RTX 4070ti 60SM/RTX 4070ti Super 66SM
RX 7900XTX 48WGP - RTX 4080 76SM/RTX 4080super 80SM

Ho, you should check GB202's SM amount. And if my speculation is correct, upcoming GB203 should come with 144SM and GB205 should come with 80SM. Go figure...

branch_suggestion · Apr 28, 2024

CouncilorIrissa said:
https://twitter.com/x/status/1784561456694046744

Quite sad to see the N40 and N41 canned. Looks like an ambitious design to say the least.

16WGP per SED, I wonder the SA config.

gaav87 said:
We do not know that.

It is the first GPU in a long time that has >20% perf headroom with pushing power beyond the design power. The thing is broken.

gaav87 said:
But i think its more likely then 3050mhz clock that amd increased per-SIMD matrix multiplication. Maybe they changed simd width or increased simd count per wgp leaving wave slots at 16 or lowered wave slots even more they went from 20 to 16 rdna2->rdna3 but each simd can operate 32 slots this would make sense that they introduced SWMMAC. Lowering wave slots would lower occupancy lowering latency and increasing performance.

Well RDNA4 does aim to simplify the WGP in clever ways to make it more spammable. That would only make clocks go even higher, rather than increasing per unit throughput.

Tigerick said:
Ho, you should check GB202's SM amount. And if my speculation is correct, upcoming GB203 should come with 144SM and GB205 should come with 80SM. Go figure...

Not a chance in a million years that a BW SM will be equivalent to an RDNA4/5 WGP.
Stop pretending that you know anything, I grow tired of it

.

Tigerick · Apr 28, 2024

branch_suggestion said:
16WGP per SED, I wonder the SA config.

It is the first GPU in a long time that has >20% perf headroom with pushing power beyond the design power. The thing is broken.

Well RDNA4 does aim to simplify the WGP in clever ways to make it more spammable. That would only make clocks go even higher, rather than increasing per unit throughput.

Not a chance in a million years that a BW SM will be equivalent to an RDNA4/5 WGP.
Stop pretending that you know anything, I grow tired of it

.

Then ignore me, thank you

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Golden Member

Platinum Member

Platinum Member

Diamond Member

Senior member

Senior member

Member

Senior member

Member

Junior Member

Golden Member

Member

Junior Member

Attachments

Member

Junior Member

Member

Member

Senior member

Junior Member

Platinum Member

Senior member

Senior member

Senior member

Member

Senior member