• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Question Speculation: RDNA2 + CDNA Architectures thread

Page 40 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DDH

Member
May 30, 2015
167
168
111
This is an even stronger case for more than 80CU in a ~500mm^2 GPU with HBM controllers. 1 CU = 3.2mm^2.

Either we get more than 80 CU (96 or even more) OR Big Navi is much less than 500mm^2.
Yeah I was wondering this myself. Maybe there is big Navi, and then bigger Navi

Because remember, the mi100 has 128CUs with 120 active



Also, 1 CU is smaller than 3.2. I calculated through pixel measurement that 1 DCU was ~ 3.6mm2, by 28 = 100mm2 for CU only. Which seems accurate as there is a lot of GPU space dedicated to other things besides the CUs, cache probably.

The XBSX die shot is online so please verify my estimations if you feel like it
 
Last edited:

A///

Senior member
Feb 24, 2017
829
578
106
Read something interesting in the comments section of some site. I forget which. But apparently drivers were hampered on the RX5000 lineup because they had to split the code between GCN and RDNA. And that RDNA2 drops all legacy code. IIRC GCN was a bastard child of CDNA, right?
 

maddie

Diamond Member
Jul 18, 2010
3,387
2,332
136
I would be equally satisfied if "Big"navi~80CU turned out to be ~300mm^2. that would possibly mean a very good performance, without the need for costs being as high as for eg the 3070.
Or at least a nicely priced mid Navi with 80CU. Won't prevent a larger die >80CU from existing for the glory seekers.
 
  • Like
Reactions: Tlh97 and kurosaki

maddie

Diamond Member
Jul 18, 2010
3,387
2,332
136
Yeah I was wondering this myself. Maybe there is big Navi, and then bigger Navi

Because remember, the mi100 has 128CUs with 120 active



Also, 1 CU is smaller than 3.2. I calculated through pixel measurement that 1 DCU was ~ 3.6mm2, by 28 = 100mm2 for CU only. Which seems accurate as there is a lot of GPU space dedicated to other things besides the CUs, cache probably.

The XBSX die shot is online so please verify my estimations if you feel like it
Just did it and got the 3.2 number again by only using the GPU portion from image here. This still has the CPU cores, and if this is removed, it leaves enough room for two instances of the XBox series X GPU with HBM2 memory controllers in a 500mm^2 die.

Xbox_Series_X_slide.jpg
 

Stuka87

Diamond Member
Dec 10, 2010
5,296
1,077
136
Read something interesting in the comments section of some site. I forget which. But apparently drivers were hampered on the RX5000 lineup because they had to split the code between GCN and RDNA. And that RDNA2 drops all legacy code. IIRC GCN was a bastard child of CDNA, right?
CDNA is more of an offspring of GCN. GCN came long long before CDNA. But GCN is compute heavy, and its very good at compute. Which is why it was the basis for CDNA.

Not sure that comment you mention makes sense though. RDNA is a new ISA, so the drivers are brand new for it (Which is why they had some growing pains). This didn't impact old GCN drivers, and the GCN drivers would not have any direct impact on RDNA drivers. What would have was potential resource limitations of the driver team being split between new and legacy.
 
  • Like
Reactions: spursindonesia

DDH

Member
May 30, 2015
167
168
111
Just did it and got the 3.2 number again by only using the GPU portion from image here. This still has the CPU cores, and if this is removed, it leaves enough room for two instances of the XBox series X GPU with HBM2 memory controllers in a 500mm^2 die.

View attachment 29150
I think they could get more than 80cus in 500mm2. 172mm2 56cu X2 is 344, + 100mm2 for a 512bit gddr bus = 444. Of course this leaves out the controllers and probably lots of other things, but this would have been 102cus and 16 ggdr bus's. Just a fun speculative though
 

DDH

Member
May 30, 2015
167
168
111
CDNA is more of an offspring of GCN. GCN came long long before CDNA. But GCN is compute heavy, and its very good at compute. Which is why it was the basis for CDNA.

Not sure that comment you mention makes sense though. RDNA is a new ISA, so the drivers are brand new for it (Which is why they had some growing pains). This didn't impact old GCN drivers, and the GCN drivers would not have any direct impact on RDNA drivers. What would have was potential resource limitations of the driver team being split between new and legacy.
The post was on Reddit, linked on OCUk. I'll see if i can dig it up
 

DisEnchantment

Senior member
Mar 3, 2017
739
1,756
106
CDNA is more of an offspring of GCN. GCN came long long before CDNA. But GCN is compute heavy, and its very good at compute. Which is why it was the basis for CDNA.

Not sure that comment you mention makes sense though. RDNA is a new ISA, so the drivers are brand new for it (Which is why they had some growing pains). This didn't impact old GCN drivers, and the GCN drivers would not have any direct impact on RDNA drivers. What would have was potential resource limitations of the driver team being split between new and legacy.
GCN 1.x architecture is not actually compute heavy. RDNA 1 has the same compute throughput like VII clock for clock for example, if not more. Of course, RDNA can do additional scalar operations and better at branching code.
The issue is that a full wave64 need 4 cycles in GCN to complete using 4x SIMD16.

With compute loads it is always possible to keep the pipeline busy because the SIMD16 can always engage every cycle, executing something that is part of consecutive wavefronts.
So compute loads are better suited for GCN.
For graphics it could be that the whole wave has to complete to have something before scheduling the next wave(so there is a 4 cycle latency), or it could be that the wavefront is not so wide.
Thus GCN/Vega struggles to keep the SIMDs engaged always and this results in lower performance even though theoretically the TFLOPs is fairly high.

Instruction wise, RDNA HW can run all the GCN instructions.
Besides if we are talking for PC, the shader compiler will JIT the shader code anyway. Unlike consoles where the shader binaries are shipped precompiled.
That said, LLVM introduces a new set of instructions and extensions for RDNA2 which older GCN HW will not be able to run.

1599143208798.png
 

senseamp

Lifer
Feb 5, 2006
34,747
4,619
126
Dang, I guess TSMC and AMD didn't realize they had to start and workout a production schedule so they could deliver 10M processors and graphics chips to Sony by the end of March 2021.

Guess we won't see Zen3 or RDNA until next fall. See ya later, alligators!
You might see them in small quantities, pricier, or delayed until more mobile players move on to 5nm. If Ryzen is selling like hotcakes, those dies are far more profitable for AMD than GPUs, unless the GPU can be in the $700+ range.
 

eek2121

Senior member
Aug 2, 2005
729
783
136
GCN 1.x architecture is not actually compute heavy. RDNA 1 has the same compute throughput like VII clock for clock for example, if not more. Of course, RDNA can do additional scalar operations and better at branching code.
The issue is that a full wave64 need 4 cycles in GCN to complete using 4x SIMD16.

With compute loads it is always possible to keep the pipeline busy because the SIMD16 can always engage every cycle, executing something that is part of consecutive wavefronts.
So compute loads are better suited for GCN.
For graphics it could be that the whole wave has to complete to have something before scheduling the next wave(so there is a 4 cycle latency), or it could be that the wavefront is not so wide.
Thus GCN/Vega struggles to keep the SIMDs engaged always and this results in lower performance even though theoretically the TFLOPs is fairly high.

Instruction wise, RDNA HW can run all the GCN instructions.
Besides if we are talking for PC, the shader compiler will JIT the shader code anyway. Unlike consoles where the shader binaries are shipped precompiled.
That said, LLVM introduces a new set of instructions and extensions for RDNA2 which older GCN HW will not be able to run.

View attachment 29152
That applies to Navi, not Navi2X. We already know that AMD has made changes in this area based on commits to the Mesa source code.
 

maddie

Diamond Member
Jul 18, 2010
3,387
2,332
136
Let's not get the hype train going too fast. After all there isn't even a conductor....Raja left for Intel and so far there hasn't been any volunteers.
Hype is one thing, but we have the XBox die shot to work with. If you just remove the CPU clusters you are left with ~ 317mm^2 for a 56CU + 320 bit GDDR6 interface + all the IO + multimedia circuitry.

Even if we do a simple ratio analysis (the worst one as it expands every part equally), we get 158% for a 500mm^2 die or ~ 88CU.

It strongly suggests that even with a GDDR6 512 bit memory bus, the 80CU @ ~ 500mm^2 die is wrong. We either have more than 80CU or a smaller die and HBM2 controllers will allow for even more CU.

Where am I so wrong in this?
 

blckgrffn

Diamond Member
May 1, 2003
7,366
636
126
www.teamjuchems.com
Hype is one thing, but we have the XBox die shot to work with. If you just remove the CPU clusters you are left with ~ 317mm^2 for a 56CU + 320 bit GDDR6 interface + all the IO + multimedia circuitry.

Even if we do a simple ratio analysis (the worst one as it expands every part equally), we get 158% for a 500mm^2 die or ~ 88CU.

It strongly suggests that even with a GDDR6 512 bit memory bus, the 80CU @ ~ 500mm^2 die is wrong. We either have more than 80CU or a smaller die and HBM2 controllers will allow for even more CU.

Where am I so wrong in this?
Could be greater hardware allocation to RT type compute? More dead space to allow for effective cooling?
 

eek2121

Senior member
Aug 2, 2005
729
783
136
Assuming Navi2X has a similar density to what is speculated in the Xbox die shot, AMD could have taken a couple of different routes:

  1. Beef up GPU performance by fixing bottlenecks, widening things, etc. I suspect this really isn’t needed.
  2. Sell Big Navi for slightly cheaper than the 3080. Everyone’s collective jaws would drop if AMD pushed out a part that was competitive with the 3080, but only costed $399-$499.
 
  • Like
Reactions: Tlh97

maddie

Diamond Member
Jul 18, 2010
3,387
2,332
136
Could be greater hardware allocation to RT type compute? More dead space to allow for effective cooling?
Fair enough. Although these factors should already have been accounted for by the XBox die seeing as it has a good frequency and RT hardware (CU based).
 

eek2121

Senior member
Aug 2, 2005
729
783
136
Hype is one thing, but we have the XBox die shot to work with. If you just remove the CPU clusters you are left with ~ 317mm^2 for a 56CU + 320 bit GDDR6 interface + all the IO + multimedia circuitry.

Even if we do a simple ratio analysis (the worst one as it expands every part equally), we get 158% for a 500mm^2 die or ~ 88CU.

It strongly suggests that even with a GDDR6 512 bit memory bus, the 80CU @ ~ 500mm^2 die is wrong. We either have more than 80CU or a smaller die and HBM2 controllers will allow for even more CU.

Where am I so wrong in this?
A third option just occurred to me: tensor cores.
 

maddie

Diamond Member
Jul 18, 2010
3,387
2,332
136
A third option just occurred to me: tensor cores.
Nah. Even Nvidia is deprecating the use of tensor cores on the sly.

Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
 

ASK THE COMMUNITY