Discussion RDNA4 + CDNA3 Architectures Thread

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

KompuKare

Golden Member
Jul 28, 2009
1,014
924
136
16GB HBM3 per stack. A single stack would provide up to 819 GB/s at 6.4gbps. One such stack would provide enough BW for every GPU except RTX 4090 and RX 7900XTX.
Yet, they still use GDDR6.:(

I imagine not only the cost would be too high, but the advanced packaging etc, needed might be a trouble too.

After Fury and Vega where speculation was that some HBM parts were sold at a loss, I suspect AMD are not too keen to use that in mainstream GPUs again despite all the years of work AMD put in to help develop HBM in the first place.
 

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
16GB HBM3 per stack. A single stack would provide up to 819 GB/s at 6.4gbps. One such stack would provide enough BW for every GPU except RTX 4090 and RX 7900XTX.
Yet, they still use GDDR6.:(

If they wanted to use HBM they'd need to include extra transistors on every chip or create a separate chip with a different memory controller.

My hope is that as designs move more in the MCM direction that we see separate chiplets for both GDDR and HBM that can be used interchangeably with other chiplets.
 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
I imagine not only the cost would be too high, but the advanced packaging etc, needed might be a trouble too.

After Fury and Vega where speculation was that some HBM parts were sold at a loss, I suspect AMD are not too keen to use that in mainstream GPUs again despite all the years of work AMD put in to help develop HBM in the first place.

Higher the volume, the lower the cost with these packaging technologies - according to AMD.
 

JustViewing

Member
Aug 17, 2022
135
232
76
16GB HBM3 per stack. A single stack would provide up to 819 GB/s at 6.4gbps. One such stack would provide enough BW for every GPU except RTX 4090 and RX 7900XTX.
Yet, they still use GDDR6.:(
But like in Fury, Vega7 and other AMD products reviewers and internet will find something to shoot it down even if those are not related to price, performance or power (hotspot, 95C, Cooler noise, Cant overclock, not boosting to 'Up-to speed', no hdmi 2.0, no cuda, PCIe 4x, AMD uses 5W more in comparable ~200W card- Power hungry, complain about price when previous cheaper products were not appreciated, etc.).

If the HBM models were not appreciated in public why should AMD bother? Same with Polaris, it was meant for cheap mass market product, but didn't get the love from reviewer or internet. Because of that we don't have cheap video cards now.

Personally I love cards like Fury Nano, compact card with good performance and power consumption. This should have been the trend, but now we have the opposite with Graphics bricks.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,355
2,848
106
But like in Fury, Vega7 and other AMD products reviewers and internet will find something to shoot it down even if those are not related to price, performance or power (hotspot, 95C, Cooler noise, Cant overclock, not boosting to 'Up-to speed', no hdmi 2.0, no cuda, PCIe 4x, AMD uses 5W more in comparable ~200W card- Power hungry, complain about price when previous cheaper products were not appreciated, etc.).

Personally I love cards like Fury Nano, compact card with good performance and power consumption. This should have been the trend, but now we have the opposite with Graphics bricks.
I would love compact cards, but Nano wasn't really a great card. The only thing going for Fury Nano was It's compact size and fan noise in comparison to GTX 980 Ti.
If you check RTX 4070Ti then there is barely anything and It could be a lot smaller.
front.jpg

The problem is that huge cooler because of high power consumption.

Nvidia or AMD could make at least one low TDP model per chip for desktop, but not many would buy one unless price is correct.
Take 4070Ti for example. It supposedly costs $799 and has 285W TBP.
If you want a low TBP one then let's set It to 180W for the whole board. Clocks could be comparable to mobile RTX4080, but I will set It a bit lower to 2200MHz because of the full chip and to be sure about TBP being enough for this clockspeed.

The result is:
You saved 39% of power, but performance suffered 2200/2610=0.84 or 16%. If you wanted to keep the same perf/price ratio, you would need to set the price to $671.
You save something on a cheaper cooler and PCB cost, but that's not enough. You could even exchange GDDR6x 21gbps for GDDR6 18gbps and TFLOP/BW would stay the same, but you won't save $120 on this.

They could release It as RTX 4070 for $679, I think for that price It's doable, but what would you do with the faulty AD104 chips, and you would also use up the fully functional chips for this, which could have been used for RTX 4070Ti, although I don't think demand will be that strong to make this a problem.
You can't set the same price even If they performed the same, because the faulty ones with less SM would need higher TBP because of higher clocks.

This is easy with laptops, because OEMs choose based on power envelope not the end performance and let the customer pay for It. They will market It as an ultrathin premium laptop with better chassis etc.
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,355
2,848
106
If the HBM models were not appreciated in public why should AMD bother? Same with Polaris, it was meant for cheap mass market product, but didn't get the love from reviewer or internet. Because of that we don't have cheap video cards now.
That may be true, but I would have loved this HBM + compact card trend to continue.
HBM3 is a great memory, so people like me or you are wondering why It's not used more.
Is It really so expensive?
4 years ago AMD released Radeon VII with 4 stacks of 16GB HBM2 1TB/s for $699. One stack is 4GB and 256GB/s.
Is a single 16GB HBM3 stack today costlier than 4 back then? Hardly.

I will try to calculate production cost for GPUs, defect density is 0.053/sq.cm and I ignore the faulty ones.
Wafer prices are 7nm at that time $8000, 5nm currently $15000, 6nm currently $7000.

Radeon VII is 331mm2 -> 139 good dies. 8000/139= $57.5, with packaging maybe $85.
MCD is 37mm2 -> 1547 good dies. 7000/1547 = ~ $4.5
N31 GCD is 300mm2 -> 160 good dies. 15000/160 = $94 + 6*$4.5 = $121, with packaging maybe $155.
N32 GCD is 200mm2 -> 251 good dies. 15000/251 = $60 + 4*$4.5 = $78, with packaging maybe $100.
N33 monolith 204mm2 -> 250 dies. 7000/250 = $28, with packaging maybe $40.

I found this, but it's >10 years old. AMD asked 23-27% of the MSRP.
uzkPJ50.png

HD 6970 is 389mm2 at 40nm and has 113 good dies. Cost was at that time what? I found $2274 but If It's correct I don't know. 2274/113=$20+$5 packaging, maybe.
AMD asked $85 and that is 240% more. If I apply the same to these chips, It would look like this. GDDR6 is $20 per 2GB GDDR6 module included in BOM.
MSRPChip costHistorical price for manufacturers
340% of die cost or (23-27% of MSRP)
BOMWhat's left
Radeon VII$699 $85 (HBM2 is missing)$289 or ($161-189) (HBM2: + $240)$70$100 or ($200-228)
RX 7900 XTX$999$155$527 or ($230-270)$330$142 or ($399-439)
RX 7800 XT$749$100$340 or ($172-202)$240$169 or ($307-337)
RX 7600 XT$449$40$136 or ($103-121)$150$163 or ($178-196)
The first option based on die cost is wrong today. The second one looks more believable, but realistically It should be more than 23-27% of the MRSP.

I will change It to 8% shop, 2% shipping, 20% manufacturer margin, you will be left with 70% of MRSP for BOM and AMD.
Radeon VII: $489 - $70 = $419 (60% of MRSP without HBM2 cost you are left with $179 so 2.1x higher price than die cost)
RX 7900 XTX: $699 - $330 = $369 (37% of MRSP, 2.4x higher price than die cost)
RX 7800 XT: $524 - $240 = $284 (38% of MRSP; 2.8x higher price than die cost)
RX 7600 XT: $314 - $150 = $164 (36.5% of MRSP; 4.1x higher price than die cost)
This price is what AMD can ask for these chips.

What does this have to do with HBM? Not much. :D

Now back to HBM3, finally.;) Let's say a single 16GB HBM3 stack costs $80.
Let's keep the price for GDDR6 at $20 and add memory cost to chip cost.
RX 7900 XTX: 12*$20 + $155 = $395
RX 7800 XT: 6*$20 + $100 = $220
RX 7600 XT: 4*$20 +$40 = $120

So HBM3 needs to be cheaper than above-mentioned prices, realistically you would also save something on a smaller GCD die or monolith, but I will ignore It now.
Let's say 1x stack of 16GB HBM3 costs $80, packaging $20 instead of $12 I used for N33. Production cost is $28+$20+$80 = $128.
Let's say 2x stacks of 16GB HBM3 costs $160, packaging $22 instead of $22 I used for N32. Production cost is $60+$22+$160 = $242.
Let's say 3x stacks of 16GB HBM3 costs $240, packaging $25 instead of $34 I used for N31. Production cost is $94+$25+$240 = $359.

I will subtract GDDR6 memory from BOM and what's left is what can AMD ask for chip+HBM3.
RX 7900 XTX: $699 - $90 = $609 (61% of MRSP)
RX 7800 XT: $524 - $80 = $444 (59% of MRSP)
RX 7600 XT: $314 - $70 = $244 (54% of MRSP)

Production cost is higher for N32 and N33 with HBM3, for N31 It's actually smaller. This is actually not bad, not bad at all.
As a compensation, you would save some power by using HBM3, there is still the smaller N31,N32, N33 die by getting rid of interconnect and IC.
You have 16GB, 32GB and 48GB HBM3 Vram + a lot higher Bandwidth.
N33: 819GB/s vs 320GB/s (128bit 20gbps GDDR6)
N32: 1638GB/s vs 640GB/s (256bit 20gbps GDDR6)
N31: 2457GB/s vs GB/s 960GB/s (384bit 20gbps GDDR6)
With this is mind, those chips should perform better and have more Vram, so you can simply increase the MRSP by $50-75-100, which is 10-11% more per card.
You end up with higher profit(margin), or It will compensate for HBM3 If it costs more than $80 per stack.:)
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,355
2,848
106
Production cost is higher for N32 and N33 with HBM3, for N31 It's actually smaller. This is actually not bad, not bad at all.
As a compensation, you would save some power by using HBM3, there is still the smaller N31,N32, N33 die by getting rid of interconnect and IC.
You have 16GB, 32GB and 48GB HBM3 Vram + a lot higher Bandwidth.
N33: 819GB/s vs 320GB/s (128bit 20gbps GDDR6)
N32: 1638GB/s vs 640GB/s (256bit 20gbps GDDR6)
N31: 2457GB/s vs GB/s 960GB/s (384bit 20gbps GDDR6)
With this is mind, those chips should perform better and have more Vram, so you can simply increase the MRSP by $50-75-100, which is 10-11% more per card.
You end up with higher profit(margin), or It will compensate for HBM3 If it costs more than $80 per stack.:)
Continuation:
If frequency can't be increased that much because of TBP to gain that performance, then just add more CU. Even just 2x Vram could be interesting to some customers.
N33 16GB HBM3 $499 vs N33 8GB GDDR6 $449? Take my money. ;)

N33 will save 32MB IC because It's not needed anymore with HBM3, that's ~20mm2 and HBM PHY should be smaller than GDDR6 PHY. Let's say you save 25mm2. Within this size you should be able to put +8CU, which is 25% more TFLOPs at the same clockspeed or 40CU(20WGP) in total.
And something similar should be possible for N32 and N31, because interconnect uses up a lot of space.

edit: edited text a bit.
 
Last edited:

Anhiel

Member
May 12, 2022
64
18
41
Here's some 3rd party info from Locuza on the MI300:
The photos are misleading. Since I doubt AMD would lie on the disclosed technicality... they said it has 9 chiplets and 8 memory dies. This matches their layout graphic is you consider 6 chiplets to be CDNA3; 2 chiplets each with 12 Zen4 cores and one chiplet with the AI or XDNA. This also matches the performance scaling they published. Given this we can also estimate the new XDNA performance per area.
 
  • Like
Reactions: Tlh97 and Vattila

TESKATLIPOKA

Platinum Member
May 1, 2020
2,355
2,848
106
If they wanted to use HBM they'd need to include extra transistors on every chip or create a separate chip with a different memory controller.

My hope is that as designs move more in the MCM direction that we see separate chiplets for both GDDR and HBM that can be used interchangeably with other chiplets.
It would be a chip with only HBM controller, so basically a different chip.
Why I would like HBM is so there would be no good reason for separate MCD, you wouldn't even need Infinity cache, this way you will save on silicon.
Doesn't mean I am against chiplets or MCM, but with RDNA3 HBM3 looks like a better option.
 

Khanan

Senior member
Aug 27, 2017
203
91
111
It will be expensive to produce but people seem to forget that it’s a super expensive server part and this means AMD will probably make Bank with it. As a reminder, AMD is lately so popular with server stuff that they can’t keep up with demand. It won’t be different with this genius processor that is really in the best spirit of “AMD”.
 
  • Like
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
It would be a chip with only HBM controller, so basically a different chip.
Why I would like HBM is so there would be no good reason for separate MCD, you wouldn't even need Infinity cache, this way you will save on silicon.
Doesn't mean I am against chiplets or MCM, but with RDNA3 HBM3 looks like a better option.

You'd still want infinity cache for better latency regardless of the bandwidth. If you could get the drivers to figure out the best bits to keep in that cache it would be the best of both worlds.
 
  • Like
Reactions: Tlh97

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
If you could get the drivers to figure out the best bits to keep in that cache it would be the best of both worlds.
I'd think "dumb" victim caches that ideally eject lines least recently accessed would be most efficient and don't need driver logic to work well. Should be worth it as long as latency improvement is significant over the next level in the cache/memory hierarchy.
 

Kaluan

Senior member
Jan 4, 2022
500
1,071
96
While I do get CDNA3 having some hard data about it available (and launching within the year), basically being a hot subject right now...

...And well, MI300 (maybe) giving us some insight into next gen Radeon aside. I can't help but think no one directed their speculation at RDNA4 because RTG and RDNA leaks are viewed as a can of worms right now. 😅

Sad tho, there's a mountain of speculation and related tangents to be had, that's why these forums exist.
 

Kaluan

Senior member
Jan 4, 2022
500
1,071
96
Well shucks, no takers? Guess I'll go first.

With the alleged "extended chiplet design" RDNA4 may bring... any guesses on what will the IO die contain? And how small can it be?

I'm guessing IO will contain stuff like the display controller and PCIE controller. But how about the media engines? It could be that those stay on the graphics chiplets, they could give added "value" to SKUs based on N41. Having 3x the media decoders/encoders (say, on a RX 8900) vs single on a N43 SKU.
 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
While I do get CDNA3 having some hard data about it available (and launching within the year), basically being a hot subject right now...

...And well, MI300 (maybe) giving us some insight into next gen Radeon aside. I can't help but think no one directed their speculation at RDNA4 because RTG and RDNA leaks are viewed as a can of worms right now. 😅

Sad tho, there's a mountain of speculation and related tangents to be had, that's why these forums exist.

Nothing in relation to Radeon, but here is one educated guess as far as performance I came across:


In summary, he expects 2x raw performance on FP64.

Better on smaller data elements typically used in AI.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Don't know who is this All The Watts Fellow, but seems he is reporting some stuffs which is now echoed by few LeakTubers.
1675598633209.png

Not coincidentally, found a patent around this as described in the above leak. I am not sure if said leaker started reading patents and made up leaks, hahaha.
Basically the patent has 1x, 2x and 3x GCX config as shown in leak LOL.
Shader Engine Die (SED) --> Stacked on top of base die
Base Dies --> Memory Controller + CP + LLC
LSI used to connect the base dies.
Inventor is Mike Mantor, Senior Fellow at AMD

20220320042 - DIE STACKING FOR MODULAR PARALLEL PROCESSORS

Difference between the leak and the patent are
  • There are MCDs in that leak, whereas in patent the IC is within the base die.
  • Each GCD is basically functional as a GPU with the LLC/CP on the base die and SEDs on top.
  • Patent calls the individual stacked die (based + SEDs) as a GCD (see below). There are multiple GCD in the GPU
  • SEDs could perhaps be the GCX?

Although illustrated as including two SEDs 412, those skilled in the art will recognize that any number of processing units may be positioned in the processing unit layer stacked above the active interposer die 404. In this configuration, a portion of a conventional graphics complex die (GCD) is pushed up to a second floor based on 3D die stacking methodologies by positioning the plurality of shader engine dies 412 in a layer on top of the active interposer die 404.
Referring now to FIG. 7, illustrated is a block diagram of a plan view 700 of a graphics processor MCM 702 employing graphics processing stacked die chiplets in accordance with some embodiments. The graphics processor MCM 702 (similar to the parallel processor MCM 202 of FIG. 2) is formed as a single semiconductor chip package including N=3 number of communicably coupled graphics processing stacked die chiplets 602 of FIG. 6. As shown in plan view 700, the graphics processor MCM 702 includes a first graphics processing stacked die chiplet 702a, a second graphics processing stacked die chiplet 702b, and a third graphics processing stacked die chiplet 702c.
Additionally, in various embodiments, the base active interposer die 604 includes one or more levels of cache memory 610 and one or more memory controller PHYs 614 for communicating with external system memory (not shown), such as dynamic random access memory (DRAM) module.
1675599098602.png
From this patent it is basically MI300 tech.

UPDATE:
Twitter Account deleted.
 
Last edited: