Discussion RDNA4 + CDNA3 Architectures Thread

Page 80 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Mahboi

Golden Member
Apr 4, 2024
1,033
1,897
96
This effect applies more so in industries that require huge barriers to entry. Only those with the financial resources to more or less guarantee success can thrive. New entrants or competitors seeking a bigger slice of the pie must invest a disproportionate amount of resources in comparison to the incumbent if they want to succeed. See Intel vs. TSMC for example. If you're lagging behind TSMC, you must invest more and be more aggressive or else you will never catch up.
And HW particularly suffers in this regard compared to software/services. Failbook or Twitter can afford to waste money for 10 years because it's assumed that the power of being N°1/having no real competition is going to drive income in some far away future. The one with the biggest database/client list wins. Once installed, Discord, Twitter, Facebook etc, never get replaced unless they severely go down in terms of feature parity.

But HW is a veeeeery different affair. HW can be great one gen and bad the next. There is no 10Y open window to "grow the business". The business sells now, or doesn't, may sell later, or may not, but there is no "it doesn't run a profit, and may run a profit in 10 years". No permanent customer/user retention. And HW is well, produced. If Twitter has no success for 5 years then suddenly explodes, they need infrastructure to handle 1M users for 5years then need a massively larger infra for 100M users starting when these users roll in. Their costs, in machines at least, depend on their success.
If Intel sells crappy ARC and nobody wants it, they're still produced. You don't sell a promise of ARC. Sure you can mitigate losses by cutting production when the product is terrible, but this is very limited, you don't stop or start mass prod without costs, and the biggest money sink was still in the R&D. Same as with software one might say, but I'd argue that good software is incredibly easy to make for an internet service today, and requires a few months of work from a competent dev team. Nowhere near what it costs to create a bleeding edge chip.

Well, I don't have much sympathy for Intel. They had the money, they had the clout, they had the history, the engineers...their competition was basically bleeding out in the ditch between 2012-2016. And look at them now.
It's Lisa time now.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,522
3,037
136
He started in a first post by stating 2770 and then in a later post he said 3GHz.

At first he aligned the datas stating 32 WGP/64 CUs/256b bus/693GB/s /2770MHz/ 240mm2.
His first post explicitly mentioned die sizes, the rest were only numbers.
1712593057739.png
2770 for N48
515 for N44
You come up with a theory that It was frequency not caring that the numbers were hugely different.
It cant be more clear than this, 2770GB/s as you stated would imply chips with roughly 100Gb/s speed while a 7800XT FI use 20Gb/s speed chips, you can see the huge discrepancy.
I was talking about IC BW -> Infinity cache bandwidth.
And yes, there is a huge difference in BW depending on how much MB cache there is.
 
  • Like
Reactions: Tlh97

Aapje

Golden Member
Mar 21, 2022
1,515
2,065
106
I disagree, AMD didn't redesign anything. RDNA 3 is fine, it's the power draw that is somehow some jerry rigged crapchute that demanded clocks be turned down 20%. Fix that (and apparently they did) and you instantly get 15% general perf improvement.

If they didn't redesign anything, we'd have Navi 41, 42 and 43. AMD is not that creative in their naming.

So Navi 48 is definitely a redesign. Whether Navi 44 is so, is unclear, because we don't know whether they intended to make 3 or 4 chips.

And if RDNA4 is only a fix of RDNA3 with 15% improvement, that would be very disappointing.
 

adroc_thurston

Diamond Member
Jul 2, 2023
3,572
5,155
96
So Navi 48 is definitely a redesign.
no it's just a new die slotted where N43 (I think 43) could not meet the cost target. next.
because we don't know whether they intended to make 3 or 4 chips.
5.
Each gen is 5 parts.
And if RDNA4 is only a fix of RDNA3 with 15% improvement, that would be very disappointing.
i don't think gfx12 has any relation to gfx11.
 

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
His first post explicitly mentioned die sizes, the rest were only numbers.
View attachment 96563
2770 for N48
515 for N44
You come up with a theory that It was frequency not caring that the numbers were hugely different.

I was talking about IC BW -> Infinity cache bandwidth.
And yes, there is a huge difference in BW depending on how much MB cache there is.
What does drive the effective BW of the Infinity Cache? There is a massive difference between the monolithic N33 which has 32MB and 477GB/s, while the 7700XT with 3 MCD has 48MB and 1995GB/s.
Even for the MCM parts though, I can't seem to figure out what actually makes up the "effective bandwidth" number. The closest I can come up with is this, where Cache BW is Effective BW - Memory Bandwidth.
CardMCDCacheEff BWMem ClockMem BWEff BW/MCDCache BWCache BW/MCD
7700 XT
3​
48​
1995​
18​
432​
665​
1563​
521​
7800 XT
4​
64​
2708​
19.5​
624​
677​
2084​
521​
7900GRE
4​
64​
2250​
18​
576​
563​
1674​
419​
7900 XT
5​
80​
2900​
20​
800​
580​
2100​
420​
7900 XTX
6​
96​
3500​
20​
960​
583​
2540​
423​

That at least gives a reasonably consistent number for Infinity Cache BW per MCD. There's still a huge disconnect between the N31 cards and the N32 cards. I assume IC bandwidth is tied to the infinity fabric speed, and N32 just runs the clock 24% faster than N31. If anyone knows the exact calculation for effective BW in RDNA3 I'd love to see it.
 

Mahboi

Golden Member
Apr 4, 2024
1,033
1,897
96

"The traditional performance IPC of RDNA4 is expected to increase by about 12% compared to RDNA3, while the improvement in light pursuit will be huge (hardware BVH traversal), and the IPC is expected to increase by about 25%.
RDNA4 should be a single-chip design, using TSMC N4P process, with a smaller area, so the cost is very low. The video memory should be GDDR6. The graphics card will be very cost-effective."
My personal copium about RDNA 4 then (bad maths incoming):

Assuming a baseline 20% extra clocks for RDNA 3 (so, 3Ghz across the board), and 15% general better performance.
Assuming that the general RT performance promised here is correct, so 25% better across the board.
Assuming also that RDNA 4 will have an extra 10% clocks, so it won't be 3Ghz base but rather above 3.2Ghz. It's on N4P, so that's not an unrealistic expectation. And also the promised 12% extra raster.
And assuming that the lack of BVH walker indeed is the reason that RDNA suffers so damn much in NV-RT games like Cyberpunk.

With a die that, in raster, will provide somewhat above 7900 xt performance:

Raster:
average-fps-2560-1440.png

So let's cut the apple in half and say that, counting the extra it's right in the middle between an XT and XTX.
150 FPS base at 1440p for N48.
Remove 12% extra perf iso clocks, and 15% perf for extra clocks.
150 x .12 == 18
132

132 x .15 == 19.2
112.8 FPS

We're in the ballpark of a 7800 xt at 109.3 FPS base. Since N48 is a 256 bit bus thing, I don't expect it to shine hard at 4K, my 7900 xt and its 320 bit bus already takes more Ls than it should vs an XTX's 384 bit.

Now taking a 7800 xt as base:
uvew2SCPdu2uCWKp4nrwDQ-1200-80.png

Taking 41 FPS at 1440p base, adding 15 + 12 general perf:
41 * 1.12 = 45.92 (46)
42 * 1.15 = 52.9 (53)

I don't think the RT rep is "25% more on top of 12%" so that's 13%:
53 * 1.13 = 59.89 fps, so we're nearing 60 FPS average, with RT on, at 1440p.

If my copium proves true, then it won't be just 20% more clocks but more like 30%, from ~2.5GHz to ~3.3Ghz. So we can add a broad 10% improvement to all of that, up to around 66 FPS average with RT on at 1440p. Don't have a clue how well it'll fare on 4K with that bus, but just upscale it.

Now the real question is how many of these games are crippled from the lack of a BVH walker. It used to be that an XTX did an abysmal 9 FPS at 4K with Path Tracing on in CP2077. Now I see a 7800 xt doing 26.9 in the tomshardware article at 1440p, so maybe something was fixed/accelerated somewhere since, or maybe "RT Ultra" doesn't mean Path Tracing on.
My understanding was "AMD raytracing is poor, but not even close to unusable for 90% of RT, and when you have too many light bounces, it falls off a cliff and becomes unusable". BVH HW walker was meant to get the unusable into playable and to up the general perf.

I can assume that a general 25% improvement + 30% more clocks so around 25% better perf is there. That's good, but it's just reaching a 4070 Ti and its 68 FPS average here. Reaching a 4070 Ti tier of RT with a 240mm² die is damn impressive, but nobody's gonna clap for AMD yet again reaching last generation's performance.
So my only hope at this point, that we'd go above a 4070 Ti and at least to a Super's general perf, is that the BVH walker goes beyond "just 25%" and sometimes takes unusable RT with too many light bounces into usable RT. As in, you have 25% general RT improvement, but in some games, it's 100% or more. Possibly up to 150% more.

If AMD has effectively fixed the last thing that made their RT worse outside of software (which still takes years of work), then they have a strong contender that'll reach circa 4070 Ti Super and possibly get closer to a 4080's RT performance, while raster is already 4080 level anyway. For a card they'll sell for $600 and that could be financially viable for $400, that's a really great product, and a great heir to the 7800 xt which I think is by far the best offering AMD has had this gen.

The Copium is on. Make me dream of functional 1440p path tracing, AMD. 4K I know won't happen without maximum upscaling, but let me dream. And the raster perf, well, it'll make a near XTX level of perf, so I'm not worried that it'll handle everything raster well enough, especially for that price.
 
Last edited:
  • Like
Reactions: Tlh97 and Tarkin77

Saylick

Diamond Member
Sep 10, 2012
3,531
7,858
136
the # of channels.
And cache hit rate, no?

If your hit rate was always 0%, your effective bandwidth matches the bandwidth of the VRAM. If you always got a cache hit, your effective bandwidth matches the IF cache bandwidth. Of course, we're going to be somewhere in the middle.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,486
2,023
136
Even for the MCM parts though, I can't seem to figure out what actually makes up the "effective bandwidth" number. The closest I can come up with is this, where Cache BW is Effective BW - Memory Bandwidth.
If anyone knows the exact calculation for effective BW in RDNA3 I'd love to see it.

It's more complex and depends on cache hit rate. IIRC something like ram bw + min (1/(cache miss rate) * ram bw, ∞cache bw). It's supposed to model how much traditional ram bw you'd need to match the combination of ∞cache and ram bus.

I think it was in the slides when AMD first released the 6000-series, I can't go looking for it now.
 
Last edited:

Mahboi

Golden Member
Apr 4, 2024
1,033
1,897
96
If they didn't redesign anything, we'd have Navi 41, 42 and 43. AMD is not that creative in their naming.

So Navi 48 is definitely a redesign. Whether Navi 44 is so, is unclear, because we don't know whether they intended to make 3 or 4 chips.

And if RDNA4 is only a fix of RDNA3 with 15% improvement, that would be very disappointing.
Wut?
Navi 41: chipletized, top tier, murdered
N42: chipletized, mid tier, murdered
N43: monolithic, entry tier, murdered
N44: monolithic, low power gaming laptop and cheapest node that's still gaming worthy, kept
N45: something different
N46: something different
N47: something different
N48: something different that turned out to be a monolithic die with roughly the same cost and power goals as what N43 would've been, but cheaper and still marketable

Where's your logic now?
Corpos during R&D phase throw everything possible that they can afford. That's why it's such a money hole, because basically any viable idea in the office will be pursued and killed off once a better idea has come forth. It just happens that a small sized, midrange power monolithic die was the 4th extra option past the main goals. Nothing about that proves that it's a "redesign".
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,486
2,023
136
Yeah but that's not a real metric.

It's a thing AMD marketing likes to talk about. I'll bet you that it will be up on the slides again when RDNA4 launches. It might not be real, but it's something that will be mentioned, and it would probably match up well with the number on that leak, which is why people think that's what the number is.
 
  • Like
Reactions: Tlh97 and Elfear

Mahboi

Golden Member
Apr 4, 2024
1,033
1,897
96
It's a thing AMD marketing likes to talk about. I'll bet you that it will be up on the slides again when RDNA4 launches. It might not be real, but it's something that will be mentioned, and it would probably match up well with the number on that leak, which is why people think that's what the number is.
Marketing likes big numbers.
 

adroc_thurston

Diamond Member
Jul 2, 2023
3,572
5,155
96
I care, because it's interesting to know, and bandwidth relates to performance.
Very very very gamey metrics. Don't.
It's a thing AMD marketing likes to talk about.
Yeah but it's not relevant.
Perf, power area. Three ultimate metrics for any GPU.
Not really sure why you say no, when the rest of your sentence is you admitting that they made a new chip.
It's a replacement.
So where are you getting 5 chips from?
We're talking RDNA4 aren't we?
Navi 21/22/23/24
Navi 31/32/33
You forgot NV36 but I forgive you.
 

MrTeal

Diamond Member
Dec 7, 2003
3,614
1,816
136
It's more complex and depends on cache hit rate. IIRC something like ram bw + min (1/(cache miss rate) * ram bw, ∞cache bw). It's supposed to model how much traditional ram bw you'd need to match the combination of ∞cache and ram bus.

I think it was in the slides when AMD first released the 6000-series, I can't go looking for it now.
There's this PDF that goes over it somewhat for the Pro cards

I can't find published values for the infinity fabric clock to confirm they're running at different speeds on N32 and N31 though.
If it depended on hit rate, it shouldn't scale linearly between the three N31 GPUs as hit rate isn't linear.
 

Saylick

Diamond Member
Sep 10, 2012
3,531
7,858
136
It's more complex and depends on cache hit rate. IIRC something like ram bw + min (1/(cache miss rate) * ram bw, ∞cache bw). It's supposed to model how much traditional ram bw you'd need to match the combination of ∞cache and ram bus.

I think it was in the slides when AMD first released the 6000-series, I can't go looking for it now.
I don't think effective bandwidth is going to be higher than the IF Cache bandwidth, simply because all data has to go through the IF Cache to enter the GPU. There aren't two pipes of bandwidth, only one which flows from VRAM to IF Cache to GPU. What bandwidth the GPU sees depends on how often it is able to tap into the full speed of the IF Cache (i.e. the hit rate). Unfortunately, the hit rate isn't some static number and it changes depending on workload.

I think the math is as follows:

Eff. Bandwidth = VRAM Bandwidth * (1 - Hit Rate) + IF Cache Bandwidth * (Hit Rate)
 

Aapje

Golden Member
Mar 21, 2022
1,515
2,065
106
Wut?
Navi 41: chipletized, top tier, murdered
N42: chipletized, mid tier, murdered
N43: monolithic, entry tier, murdered
N44: monolithic, low power gaming laptop and cheapest node that's still gaming worthy, kept
N45: something different
N46: something different
N47: something different
N48: something different that turned out to be a monolithic die with roughly the same cost and power goals as what N43 would've been, but cheaper and still marketable

Where's your logic now?
Corpos during R&D phase throw everything possible that they can afford. That's why it's such a money hole, because basically any viable idea in the office will be pursued and killed off once a better idea has come forth. It just happens that a small sized, midrange power monolithic die was the 4th extra option past the main goals. Nothing about that proves that it's a "redesign".

Well, my logic is not that I made up some supersecret research program at AMD where they develop 8 chips every gen and count up to Navi x8, but somehow have only ever released up to Navi x4 for each of the previous Navi generations.

It's beyond obvious that at least Navi 48 was a late development, which also explains why the fastest chip has the highest number, even though in every previous Navi generation, the higher the number of the chip, the weaker it got.
 

Aapje

Golden Member
Mar 21, 2022
1,515
2,065
106
It's a replacement.

Replacement = redesign.

We're talking RDNA4 aren't we?

Yes, but where do you get your information that it was going to have 5 chips? I'm looking at earlier gens, because there we know what they released

You forgot NV36 but I forgive you.

Not sure what an Nvidia chip from 2004 has got to do with any of this, or why you act so arrogant when saying something so silly.