Question Speculation: RDNA2 + CDNA Architectures thread

Page 82 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
I think the quote from the whitepaper is "Each of the four SIMDs can request instructions every cycle and the instruction cache can deliver 32B (typically 2-4 instructions) every clock to each of the SIMDs". So that would be up to a theoretical 8 per CU/16 per WGP.

In practice, if you look at what units it can issue to I believe the numbers are unchanged though (with 1 vector ALU unit per SIMD you're obviously not issuing more than 2 per CU), or at least there is no indication of any changes. Looking at the 7 I'm also pretty sure those are not the only ones that can be issued, there are categories not named (like LDS or NOPs). Pretty sure RDNA1 and RDNA2 have these but they were not included in the 7 explicitly listed on the xbox slides.

In RDNA the instruction cache which is shared per WGP can fetch 32 bytes (2-4 instructions) per clock cycle per SIMD. 4 SIMDs are present in a dual CU WGP. So 4-8 instructions per clock cycle per CU and 8-16 instructions per clock cycle per WGP.

RDNA2 does 7 instructions per clock cycle per CU. If thats sustained every clock cycle then the key question is does RDNA2 instruction cache fetch 2x the bytes per SIMD per clock cycle. Even if we assume a average 6 instructions per clock cycle per CU its still a 15% higher instruction throughput for RDNA2 with 7 instructions per clock cycle per CU. As always the devil is in the details. :)
 

andermans

Member
Sep 11, 2020
151
153
76
RDNA2 does 7 instructions per clock cycle per CU. If thats sustained every clock cycle then the key question is does RDNA2 instruction cache fetch 2x the bytes per SIMD per clock cycle. Even if we assume a average 6 instructions per clock cycle per CU its still a 15% higher instruction throughput for RDNA2 with 7 instructions per clock cycle per CU. As always the devil is in the details. :)

I'm pretty sure the numbers mentioned are just theoretical maximums in the different categories (e.g. 2 vector ALU instructions/cycle with 2 vector ALU units is obviously a max, not an average. Same for for the others.) so that wouldn't be sustained.
 

Timorous

Golden Member
Oct 27, 2008
1,748
3,240
136
This is not going to happen , sorry but then XBOX SX would be faster than RTX2080Ti then , which its not.

The 60CU die will compete against GA104 (RTX3070) and the 80CU die will compete against the GA102 (RTX3080/90)

For 40 CUs to compete with the 2080Ti RDNA2 needs a 40% IPC increase (measured crudely as fps/Tflop) with PS5 clockspeeds. If it could do this in a 225W envelope then AMD will also hit their 50% perf/watt target as the 2080Ti is about 50% faster than the 5700XT at 4k.

If this is the IPC gain then this card would have 11.5 Tflops which means the Series X would also be 2080Ti tier.

The fact a 36CU @ 2.23Ghz + 8c16t zen2 @ 3.5Ghz based SoC with 16GB GDDR6 can be powered by a 350W PSU (which will have plenty of overhead) suggests that a 2.23Ghz 40CU dGPU part could be done for less than the 225W tbp of the 5700XT.

I think a 40CU 2.23Ghz part is more likely to land in 2080S territory than 2080Ti territory due to memory bandwidth but I also suspect the 3070 will also be closer to the 2080S than the Ti.
 

blckgrffn

Diamond Member
May 1, 2003
9,299
3,440
136
www.teamjuchems.com
Oh man I love being on this train. You guys are making me legit excited 😆

I am also in the camp that is finding it hard to believe the 3070 base card will exceed the 2080 ti and expect it to instead be above the 2080S.

Also, are we really caring what a 40CU card does at 4K? It seems like that is still a domain for the highest tier cards if you want sustained 60+ FPS.

Show me it trouncing my 5700xt at 1440p and I’ll be opening my wallet.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,523
3,037
136
For 40 CUs to compete with the 2080Ti RDNA2 needs a 40% IPC increase (measured crudely as fps/Tflop) with PS5 clockspeeds. If it could do this in a 225W envelope then AMD will also hit their 50% perf/watt target as the 2080Ti is about 50% faster than the 5700XT at 4k.
RTX 2080Ti is 50% faster than RX 5700 XT. For the same performance you need 25% better IPC and 20% higher clockspeed. Of course you need enough bandwidth to not be a bottleneck.
 

Konan

Senior member
Jul 28, 2017
360
291
106
RTX 2080Ti is 50% faster than RX 5700 XT. For the same performance you need 25% better IPC and 20% higher clockspeed. Of course you need enough bandwidth to not be a bottleneck.
Yeah 192bit bus and none of that extra cache magic that will be on higher SKUs.

If anything, everything is over-exaggerated. Same thing happens every couple of years followed by "wait for the next one".
 

Avalon

Diamond Member
Jul 16, 2001
7,567
156
106
Throwing my hat in the ring here and saying Navi 21 will be at least as fast as this card:

bitchin.jpg
 

Glo.

Diamond Member
Apr 25, 2015
5,803
4,777
136
I love how things are in discussion twisted.

I was talking about 40 CU GPU being 10% above RTX 2080 Super or 35% above RX 5700 XT.

I've just checked the performance comparison with RTX 2080 Super in TPU suite. Well, if we look at TechPowerUp charts, RTX 2080 Super in 4K is 25% above RX 5700 XT.
relative-performance_3840-2160.png


Secondly. Thats the performance level that RTX 3070 will achieve, even according to Galax:

So lets get back to the discussion. How come 40 CU GPU chip cannot compete with RTX 3070, while clocked at 2.3 GHz and having 10% higher IPC than RDNA1 GPUs?

Can anyone explain this to me? From the start, Im trying to tell people that RTX 3070 WILL NOT ACHIEVE RTX 2080 Ti performance levels. I don't know why people believe in this, and spin the discussion that in order to compete with RTX 3070, 40 CU GPU has to beat RTX 2080 Ti.

Its absolutely ridiculous how overestimated Nvidia is, and their Ampere GPUs.
 
Last edited:

dzoni2k2

Member
Sep 30, 2009
153
198
116
Can anyone explain this to me? From the start, Im trying to tell people that RTX 3070 WILL NOT ACHIEVE RTX 2080 Ti performance levels. I don't know why people believe in this, and spin the discussion that in order to compete with RTX 3070, 40 CU GPU has to beat RTX 2080 Ti.

Nvidias marketing is probably just behind Apples in brainwashing power. Some people actually believe Nvidia more than independent benchmarks. They are still quoting ridiculous performance and perf/W claims from marketing slides that were proven to be complete horse*. It's quite mind boggling.
 

Glo.

Diamond Member
Apr 25, 2015
5,803
4,777
136
Nvidias marketing is probably just behind Apples in brainwashing power. Some people actually believe Nvidia more than independent benchmarks. They are still quoting ridiculous performance and perf/W claims from marketing slides that were proven to be complete horse*. It's quite mind boggling.
Simplest possible calculations.

RTX 3080 is 25-30% faster than RTX 2080 Ti.

RTX 3080 has 68 SMs, massive bandwidth.

So how come suddenly 44 CU GPU, will achieve RTX 2080 Ti performance, considering that RTX 3080 has 54%(!) more SMs? And does not use GDDR6X, but only GDDR6?

How will it mitigate the undeniable lack of hardware? Nvidia's magic?

So maybe that 40CU has way smaller hill to climb, despite what people want to believe?
 

HurleyBird

Platinum Member
Apr 22, 2003
2,760
1,455
136
And it never will achieve its potential.

Nvidia simply just GCN'd their gaming architecture. It behaves EXACTLY like GCN in games, and EXACTLY like GCN in compute. Mosterous in compute, mediocre in games, with insane inefficiency. The performance increase in compute is not reflected in gaming. For the same exact reasons why GCN never reflected its performance in games.

I'm not sure. GA100 is only designed for compute, but doesn't have the doubling of fp32 resources. fp32 doubling in Ampere seems to be aimed at gaming specifically.

But to play devil's advocate, it's obvious that Nvidia's implementation of fp32 doubling could be better for gaming too.

The 50% ratio between fp32 and int/fp32 hybrid resources seems arbitrary. How many games are going to utilize 50% int? One or two edge cases? Zero? And you can ask the same question about how many games will use 0% int, because if you use any int some of those extra fp32 transistors will go to waste.

No, what you want, and I wouldn't be surprised to see with Hopper, is some percentage of pure fp32 cores, some percentage of hybrid cores, and some percentage of pure int cores. Maybe something like 60%, 25%, and 15% respectively.
 

Saylick

Diamond Member
Sep 10, 2012
3,532
7,858
136
I'm not sure. GA100 is only designed for compute, but doesn't have the doubling of fp32 resources. fp32 doubling in Ampere seems to be aimed at gaming specifically.

But to play devil's advocate, it's obvious that Nvidia's implementation of fp32 doubling could be better for gaming too.

The 50% ratio between fp32 and int/fp32 hybrid resources seems arbitrary. How many games are going to utilize 50% int? One or two edge cases? Zero? And you can ask the same question about how many games will use 0% int, because any time you use any int some of those extra fp32 transistors are going to waste.

No, what you want, and I wouldn't be surprised to see with Hopper, is some percentage of pure fp32 cores, some percentage of hybrid cores, and some percentage of pure int cores. Maybe somewhere around 60%, 25%, and 15% respectively.
You've hit the nail on the head. The tricky part is that on a given clock cycle, it's hard to pinpoint how much INT and FP are used. Nvidia says that for every 100 FP operations, there's 36 INT operations, but that's clearly an average over some timeframe of the average workload. Some cycles the GPU doesn't need as much INT and on others it needs more INT. I'm not sure what the ideal balance of purely dedicated FP, INT, and FP/INT cores would be, but a logical approach seems to be one where you profile the typical workload and for each clock cycle, you determine the INT/FP ratio. Then you plot a histogram to see what the distribution looks like, and then you create a GPU where the hybrid cores covers like 1 or 2 standard deviations out from the mean INT/FP ratio is so that you can swing towards more INT or more FP in the vast majority of cases. A parametric study can be conducted to determine the optimal balance of perf to area of using more hybrid cores vs dedicated FP and INT cores.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,340
5,464
136
I'm not sure. GA100 is only designed for compute, but doesn't have the doubling of fp32 resources. fp32 doubling in Ampere seems to be aimed at gaming specifically.

It's really over-provisioned in FP32 for common gaming resolutions, since they didn't really double ROPs/memory BW and other functional units to support all those extra FP32 cores.

A couple of reviewers have noted that this is the reason why it does so much better at 4K (even when not CPU limited), FP32 needs increase disproportionately at higher resolutions. Probably why NVidia is marketing 8K with 3090.

While it is FP32 overkill, it is still efficient design. They took the INT32 unit and added FP32 capability, so there will be significant reuse of die area, like register and cache space. It's cheaper than doing a complete separate FP32 unit.
 
  • Like
Reactions: FatherMurphy

Zstream

Diamond Member
Oct 24, 2005
3,395
277
136
It's odd but some people seem to want benchmarks done at:

4K for CPU tests.
Under 4K for GPU tests.

Which is precisely backwards to where the differences emerge.:confused_old:

Heck if I know, I frankly don't care anymore about lusting over hardware. I'm sure AMD will have plenty of SKU's to choose from, and compete just fine with Nvidia.
 

dzoni2k2

Member
Sep 30, 2009
153
198
116
Heck if I know, I frankly don't care anymore about lusting over hardware. I'm sure AMD will have plenty of SKU's to choose from, and compete just fine with Nvidia.

They sure set the bar low cheaping out on 7nm. Everyone at AMD is probably grinning from ear to ear right now. 1st Intel and the 10nm debacle now Nvidia with Fermi v2 on 8nm. Mobile is going to be an absolute massacre for Nvidia.
 

Qwertilot

Golden Member
Nov 28, 2013
1,604
257
126
Mobile is certainly going to be rather interesting, and much more market relevant than the very top cards.

I would be very surprised if it was a massacre. NV started genuinely significantly ahead & they did get the die shrink and do know how important the mobile market is.

Fairly confident they'll get a decent performance improvement out for mobile.

AMD might well get some market share back though.