Discussion RDNA4 + CDNA3 Architectures Thread

Page 54 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,586
5,692
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,312
2,798
106
Navi 48 = 32 WGP, 64 CU, 48MB L3, GDDR7 memory, 192-bit memory bus, PCIe 5.0 x16?

Navi 44 = 16 WGP, 32 CU, 32MB L3, GDDR7 memory, 128-bit memory bus + PCIe 5.0 x8?
And this is from where?
I won't say It can't be true, but then 3 shader arrays per shader engine can't be true. It would need to be 2SA/SE or 4SA/SE.

BTW, bandwidth for N48 could be a problem even with 32gbps GDDR7.
RX 7800XT: 64MB IC and 256-bit 19.5gbps -> 624GB/s BW
N48: 48MB IC and 192-bit 32gbps -> 768GB/s BW
That's only 23% more and If you include the smaller IC then even less effective BW.
Yet the GPU has 6.7% more CU and I personally expect at least 3GHz. So this BW doesn't look enough to me.
 
Last edited:
  • Like
Reactions: Tlh97

SolidQ

Member
Jul 13, 2023
164
195
76
And this is from where?
Red tech gaming, this one saying ~7900GRE
Mlid saying mostly between 7900XT... 7900XTX
My personal opinion 7900XT+ can be with 256 bit and gddr7
That screen
357e366bfda52991a7a533e60939a33a.png
 

tajoh111

Senior member
Mar 28, 2005
298
309
136
And this is from where?
I won't say It can't be true, but then 3 shader arrays per shader engine can't be true. It would need to be 2SA/SE or 4SA/SE.

BTW, bandwidth for N48 could be a problem even with 32gbps GDDR7.
RX 7800XT: 64MB IC and 256-bit 19.5gbps -> 624GB/s BW
N48: 48MB IC and 192-bit 32gbps -> 768GB/s BW
Navi 48 = 32 WGP, 64 CU, 48MB L3, GDDR7 memory, 192-bit memory bus, PCIe 5.0 x16?

Navi 44 = 16 WGP, 32 CU, 32MB L3, GDDR7 memory, 128-bit memory bus + PCIe 5.0 x8?

That's only 23% more and If you include the smaller IC then even less effective BW.
Yet the GPU has 6.7% more CU and I personally expect at least 3GHz. So this BW doesn't look enough to me.
If Navi 48 is indeed less than 250 mm2, along with it being monolithic and made on 4nm, don't expect a big cu bump and clock speed bump. I think similar CU and 10% higher clocks along with a architecture improvements would be something more realistic. Something like this might yield a 20% increase in performance.

4nm is only 6% denser than 5nm and the performance characteristics aren't that much better either.

A 20% performance increase vs a 7800xt would actually be a good result considering the modest die size regression, and modest manufacturing node improvement.

If you want to look at something with a similar increase in specs and a modest die shrink, look at navi 23 vs navi 33.

7nm to 6nm yields a greater transistor density improvement vs 5nm to 4nm and considering the regression in die size will likely be similar, a 20% increase in performance would be great since the increase in performance of navi 33 was like 7% or 8% vs navi 23.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,312
2,798
106
If Navi 48 is indeed less than 250 mm2, along with it being monolithic and made on 4nm, don't expect a big cu bump and clock speed bump. I think similar CU and 10% higher clocks along with a architecture improvements would be something more realistic. Something like this might yield a 20% increase in performance.
I wasn't talking about a big CU bump, but a significant clock bump.
It was already shown that RDNA3 was supposed to clock higher. So I don't think 3GHz is such an overkill for RDNA4, when the cutdown N32 can already be OC-ed past 3GHz. The full N32 has worse OC, I think It's due to power limit.

Let's be honest, as @PJVol showed, RDNA3 has a problem with power draw for their chiplet GPUs, so RDNA4 by being monolith will save some power, which can be used for higher clocks.

You don't really need a big die increase for higher clocks. Just look at RDNA1 vs RDNA2, clock speeds despite the same process were raised significantly and If you exclude Infinity cache, then die size didn't increase by much, which can be attributed to added RT support.

BTW, do you know which RDNA3 GPU is most efficient when you limit It to 60FPS in cyberpunk at Full HD? RX 7600 :D
power-vsync.png
 
Last edited:
  • Like
Reactions: Tlh97

Aapje

Golden Member
Mar 21, 2022
1,257
1,685
96
You don't really need a big die increase for higher clocks. Just look at RDNA1 vs RDNA2, clock speeds despite the same process were raised significantly and If you exclude Infinity cache, then die size didn't increase by much, which can be attributed to added RT support.
It's the opposite. Smaller dies are easier to clock higher. They often reduce clocks of higher tier video cards. The 4080 runs at higher clocks than the 4090.

Anyway, I'm hoping that AMD at least is able to match the power efficiency improvements in Ada to a decent extent.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,312
2,798
106
It's the opposite. Smaller dies are easier to clock higher. They often reduce clocks of higher tier video cards. The 4080 runs at higher clocks than the 4090.
It's the opposite? What are you talking about?

Btw, the reason why 4090 has lower clock is due to power limit.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,312
2,798
106
The bigger the chip, the more weak spots you will have on the chip, that have to either be disabled or require the chip to run with a lower clock.
What I was actually asking about in my previous post was about your response.
Because what you quoted from my post and your response are not about the same thing.
 

PJVol

Senior member
May 25, 2020
505
422
106
The bigger the chip, the more weak spots you will have on the chip, that have to either be disabled or require the chip to run with a lower clock
The chip frequency (made on a certain tech process) is purely design thing and depends only on the worst case timing path.
And there are so many factors affecting it, that it makes no sense in trying to make any conclusions based on the previous uarch or generation.
 
Last edited:
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,586
5,692
136
DVGPR = Dynamic VGPR allocation ?
Seems some folks are right SALU in GFX12 is similar to GFX11.5 with addition of new 64 bit ops to handle new masks etc.

New VIMAGE ops, which are MALL aware.
Looks like MALL will be doing a lot of heavy lifting.

interesting cache policies
C:
  // Below are GFX12+ cache policy bits

  // Temporal hint
  TH = 0x7,      // All TH bits
  TH_RT = 0,     // regular
  TH_NT = 1,     // non-temporal
  TH_HT = 2,     // high-temporal
  TH_LU = 3,     // last use
  TH_RT_WB = 3,  // regular (CU, SE), high-temporal with write-back (MALL)
  TH_NT_RT = 4,  // non-temporal (CU, SE), regular (MALL)
  TH_RT_NT = 5,  // regular (CU, SE), non-temporal (MALL)
  TH_NT_HT = 6,  // non-temporal (CU, SE), high-temporal (MALL)
  TH_NT_WB = 7,  // non-temporal (CU, SE), high-temporal with write-back (MALL)
  TH_BYPASS = 3, // only to be used with scope = 3
 

Saylick

Diamond Member
Sep 10, 2012
3,048
6,090
136
Hmm, looks like RDNA 4 is going more and more towards compiler-based scheduling, which mirrors Nvidia's pivot away from hw-based scheduling since Kepler. It's ironic because AMD itself pivoted away from a compiler-based scheduler in Terascale towards a hw-scheduler in GCN. I guess what's new is old, and what's old is new.

Screenshot_2023-12-05-18-47-21-74_40deb401b9ffe8e1df2f1cc5ba480b12.jpg
 

Abwx

Lifer
Apr 2, 2011
10,822
3,286
136
A 32 CUs GPU is unlikely, that s the same CU count as a RX 7600 and this GPU was undersized in respect of what was sought by the market.

A 40CUs GPU would have been way more relevant since it would have commanded a higher price while still allowing a stepped down 32-36CUs version for harvesting purposes.

Keeping the same CU count for a next gen that is supposed to perform significantly better would be quite a blunder, technically and financiarly.
 

Abwx

Lifer
Apr 2, 2011
10,822
3,286
136
Those are tiny mainstream parts.

Too tiny to make sense, there s a perf threshold that make your product successfull of just a marginal solution, and i dont even talk of sales where it is overhelmed by the competing cards that are just a little above or even by the RX 6600 that sell much better aparently, so it wasnt worth the effort.
 

soresu

Platinum Member
Dec 19, 2014
2,530
1,715
136
Too tiny to make sense
Given it matches the max rumoured CU count of Strix Halo it does sound odd.

Unless perhaps they will also make a semi custom chiplet APU with RDNA4 instead of RDNA3.5 at some point in 2025.

Basically Strix Halo with a GPU µArch upgrade.
 

Abwx

Lifer
Apr 2, 2011
10,822
3,286
136
Given it matches the max rumoured CU count of Strix Halo it does sound odd.

Unless perhaps they will also make a semi custom chiplet APU with RDNA4 instead of RDNA3.5 at some point in 2025.

Basically Strix Halo with a GPU µArch upgrade.

That s not even enough in respect of a 12 CUs APU that is 178mm2, let alone one that has 32CUs as well, indeed German etailer Mindfactory sales numbers are indicative that the 7600 lack perfs to be competitive, at 40CUs it would have sold much better while at 32 it s buried within a plethorous offering.

In this segment people get either a RX 6600 wich is cheaper and perform not that much worse or RTX 3060/4060/4060TI wich are either price competitive for the weakest or somewhat faster for the others.

 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,312
2,798
106
Too tiny to make sense, there s a perf threshold that make your product successfull of just a marginal solution, and i dont even talk of sales where it is overhelmed by the competing cards that are just a little above or even by the RX 6600 that sell much better aparently, so it wasnt worth the effort.
The amount of CU is not a problem per se.
It's only a problem If clocks or IPC stays the same, that was the problem with RX 7600 along with the small benefit from dual-issue.

If this 32CU RDNA GPU allows higher clockspeed by 25%, 2655Mhz * 1.25 = ~3325MHz, would you care that It has only 32CU?
I wouldn't If It is sold for a reasonable price and not just 8GB Vram.

P.S. But considering Strix Halo has 40CU, I would expect at least the same amount for the weaker chip.

edit:
Maybe It's not impossible to pack 40CU in a 160mm2 chip and the bigger one will be 250mm2.
I mean 40CU,2560SP,160TMU,64ROPs,32MB,128-bit GDDR7 for the smaller one.
Clock It to 3.5GHz and you should be close to 7700XT, of course with 128-bit 30-32gbps GDDR7. :D
BTW, I got ~220mm2 for RX 7600 with 40CU using 6nm.
 
Last edited:
  • Like
Reactions: Tlh97