Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

Mopetar · Nov 22, 2020

moinmoin said:
Visually the "DXR enabled" footage is lackluster so RT is likely not applied at all, that would also explain the negative framerate difference.

I think it's fair to say that it's not working at all and just leaving all of the effects off. The alternative is an RT implementation so good you gain performance from using it. If you had that you'd be bragging about it and offering to sell your engine to others.

lightmanek · Nov 22, 2020

Bouowmx said:

Can you do tests with AIDA64 GPGPU?

Run AIDA64 Extreme. Go to Tools > GPGPU Benchmark.
Uncheck all processors except the Radeon RX 6800. Double-click the box next to:

Memory copy
Single-precision FLOPS
32-bit integer IOPS

Don't worry about [TRIAL VERSION]. Click on Results, select the correct device, and click Save.
Open the saved text file, and paste the contents as a reply.

GeForce GTX 1660, 1770 MHz core, 9900 MT/s memory

Code:

Benchmark                        Result  Run Time  Build Time
-------------------------------------------------------------
Memory Copy                 198397 MB/s   4641 ms         
- 15 MB Block               167317 MB/s      0 ms         
- 32 MB Block               183486 MB/s      0 ms         
- 64 MB Block               190988 MB/s      1 ms         
- 128 MB Block              195241 MB/s      1 ms         
- 256 MB Block              197401 MB/s      3 ms         
- 512 MB Block              198342 MB/s      5 ms         
- 1024 MB Block             198158 MB/s     10 ms         
- 1536 MB Block             198397 MB/s     15 ms         
Single-Precision FLOPS      4961 GFLOPS  10640 ms         
- float1                    4961 GFLOPS    886 ms      969 ms
- float2                    4928 GFLOPS    892 ms      141 ms
- float4                    4909 GFLOPS    896 ms      125 ms
- float8                    4819 GFLOPS    913 ms      141 ms
- float16                   4794 GFLOPS    917 ms      125 ms
32-bit Integer IOPS          4961 GIOPS   9110 ms         
- int1                       4961 GIOPS    887 ms        0 ms
- int2                       4935 GIOPS    891 ms        0 ms
- int4                       4935 GIOPS    891 ms        0 ms
- int8                       4832 GIOPS    910 ms        0 ms
- int16                      4794 GIOPS    917 ms        0 ms

No problem:

Code:

Benchmark                        Result  Run Time  Build Time
-------------------------------------------------------------
Memory Read                  20693 MB/s   1000 ms           
- Pinned                     20693 MB/s      1 ms           
- Pageable                   11309 MB/s      3 ms           
Memory Write                 26185 MB/s   2343 ms           
- Pinned                     26185 MB/s      1 ms           
- Pageable                   13241 MB/s      2 ms           
Memory Copy                 980843 MB/s   5047 ms           
- 15 MB Block               494234 MB/s      0 ms           
- 32 MB Block               764636 MB/s      0 ms           
- 64 MB Block               980843 MB/s      0 ms           
- 128 MB Block              430759 MB/s      1 ms           
- 256 MB Block              449280 MB/s      1 ms           
- 512 MB Block              458740 MB/s      2 ms           
- 1024 MB Block             463632 MB/s      4 ms           
Single-Precision FLOPS     19177 GFLOPS  16125 ms           
- float1                   19177 GFLOPS    917 ms     1390 ms
- float2                   19059 GFLOPS    923 ms     1312 ms
- float4                   19087 GFLOPS    922 ms     1390 ms
- float8                   18538 GFLOPS    949 ms     1375 ms
- float16                  18446 GFLOPS    954 ms     1297 ms
Double-Precision FLOPS      1173 GFLOPS  14734 ms           
- double1                   1098 GFLOPS    501 ms     1375 ms
- double2                   1142 GFLOPS    963 ms     1453 ms
- double4                   1113 GFLOPS    988 ms     1391 ms
- double8                   1173 GFLOPS    938 ms     1375 ms
- double16                  1058 GFLOPS    520 ms     1406 ms
24-bit Integer IOPS         18995 GIOPS  31937 ms           
- int1                      17254 GIOPS    510 ms     1985 ms
- int2                      17599 GIOPS   1000 ms     2265 ms
- int4                      18995 GIOPS    926 ms     2937 ms
- int8                      18524 GIOPS    950 ms     4844 ms
- int16                     18034 GIOPS    976 ms     9000 ms
32-bit Integer IOPS          3899 GIOPS  13156 ms           
- int1                       3899 GIOPS    564 ms     1484 ms
- int2                       3767 GIOPS    584 ms     1343 ms
- int4                       3721 GIOPS    591 ms     1375 ms
- int8                       3360 GIOPS    654 ms     1438 ms
- int16                      3726 GIOPS    590 ms     1594 ms
64-bit Integer IOPS          1020 GIOPS  20937 ms           
- long1                      1020 GIOPS    539 ms     1547 ms
- long2                       978 GIOPS    562 ms     1407 ms
- long4                       882 GIOPS    624 ms     1547 ms
- long8                       941 GIOPS    584 ms     2235 ms
- long16                      834 GIOPS    659 ms     4984 ms
AES-256                      56075 MB/s  43468 ms           
- inline loop / 4 MB         54504 MB/s    601 ms     1485 ms
- inline loop / 8 MB         55424 MB/s    591 ms           
- inline loop / 16 MB        55855 MB/s    587 ms           
- inline loop / 32 MB        55884 MB/s    586 ms           
- inline loop / 64 MB        55993 MB/s    585 ms           
- inline loop / 128 MB       56059 MB/s    585 ms           
- inline loop / 256 MB       56075 MB/s    584 ms           
- inline unroll / 4 MB       54530 MB/s    601 ms     1390 ms
- inline unroll / 8 MB       55421 MB/s    591 ms           
- inline unroll / 16 MB      55833 MB/s    587 ms           
- inline unroll / 32 MB      55929 MB/s    586 ms           
- inline unroll / 64 MB      55976 MB/s    585 ms           
- inline unroll / 128 MB     56042 MB/s    585 ms           
- inline unroll / 256 MB     56051 MB/s    585 ms           
- define loop / 4 MB         54501 MB/s    601 ms     1437 ms
- define loop / 8 MB         55399 MB/s    591 ms           
- define loop / 16 MB        55843 MB/s    587 ms           
- define loop / 32 MB        55875 MB/s    586 ms           
- define loop / 64 MB        55979 MB/s    585 ms           
- define loop / 128 MB       56029 MB/s    585 ms           
- define unroll / 4 MB       54499 MB/s    601 ms     1438 ms
- define unroll / 8 MB       55389 MB/s    592 ms           
- define unroll / 16 MB      55826 MB/s    587 ms           
- define unroll / 32 MB      55889 MB/s    586 ms           
- define unroll / 64 MB      55982 MB/s    585 ms           
- define unroll / 128 MB     56034 MB/s    585 ms           
SHA-1 Hash                  117599 MB/s  16500 ms           
- 512 blocks                 13860 MB/s    577 ms     1672 ms
- 1024 blocks                27184 MB/s    589 ms           
- 2048 blocks                51839 MB/s    617 ms           
- 4096 blocks                93467 MB/s    685 ms           
- 8192 blocks               117599 MB/s    544 ms           
- 16384 blocks               83597 MB/s    766 ms           
- 32768 blocks               64092 MB/s    999 ms           
- 65536 blocks               50040 MB/s    639 ms           
- 131072 blocks              47162 MB/s    679 ms           
Single-Precision Julia         3904 FPS  15485 ms           
- float1 break                 2123 FPS    965 ms     1297 ms
- float1 stay / unroll 3       3904 FPS    525 ms     1282 ms
- float2 stay / unroll 3       3499 FPS    585 ms     1329 ms
- float1 stay / unroll 9       3531 FPS    580 ms     1313 ms
- float2 stay / unroll 9       3882 FPS    528 ms     1281 ms
- float4 stay / unroll 9       3210 FPS    638 ms     1328 ms
Double-Precision Mandel       394.7 FPS  13625 ms           
- double1 break               379.1 FPS    675 ms     1312 ms
- double1 stay / unroll 3     393.4 FPS    651 ms     1297 ms
- double2 stay / unroll 3     317.8 FPS    805 ms     1296 ms
- double1 stay / unroll 9     394.7 FPS    649 ms     1313 ms
- double2 stay / unroll 9     343.0 FPS    746 ms     1297 ms

Card is OC as far as current limits allow, 2600MHz Core and 1150MHz RAM.

CastleBravo · Nov 22, 2020

PhoBoChai said:
More knowledgeable say that? Then they are plain wrong.

They compare 4K vs 2x FP32 Ampere designs:

But they don't actually say the correct thing, 3080 & 3090 only scales well for it's Tflops at 4K. Its quite inefficient perf/tflops at 1080p and 1440p.

6800XT is efficient at all resolutions. The performance it achieves for it's raw shader power is superior.

If you're a 1440p gamer, 6800XT is superior.

If you're a high refresh 1080p gamer, 6800XT is much better.

If you're an 4K gamer, stock, the 3080 is superior.

OC, the 6800XT is better.

The only scenario where 3080 is better is with games that has DLSS 2.0 and RTX NV sponsored games.

Saying that the 3080 is better at RT is misleading because it ignores the fact that in non-NV sponsored RT games, the 6800XT competes very well in RT.

Exactly which "non-NV sponsored RT game" does the 6800 XT compete very well in? In Dirt 5, the RT effects are minimal, and there is obviously something else going on since the AMD card has a massive 30% lead in non-RT FPS. In Watch Dogs Legion, RT seems to be bugged with AMD's card either not actually doing RT, or doing it at much lower quality than Ampere despite equal in-game settings.

The way I see it, the +-10% difference in 1080p to 4k rasterization performance is close enough as to make no real difference to the end user, but ray tracing performance is significantly higher on Ampere in every title that has RT effects which are complex enough to make a noticeable difference in image quality.

GaiaHunter · Nov 22, 2020

CastleBravo said:
The way I see it, the +-10% difference in 1080p to 4k rasterization performance is close enough as to make no real difference to the end user, but ray tracing performance is significantly higher on Ampere in every title that has RT effects which are complex enough to make a noticeable difference in image quality.

Translation - Minecraft and Doom RTX.

RT didn't make a compelling case for 2000 series over 1000 series. RT didn't even make a compelling case for 3000 series over 2000 series - it was all about the extra performance at a cheaper price.

Bouowmx · Nov 22, 2020

lightmanek said:

No problem:

Code:

Benchmark                        Result  Run Time  Build Time
-------------------------------------------------------------
Memory Read                  20693 MB/s   1000 ms          
- Pinned                     20693 MB/s      1 ms          
- Pageable                   11309 MB/s      3 ms          
Memory Write                 26185 MB/s   2343 ms          
- Pinned                     26185 MB/s      1 ms          
- Pageable                   13241 MB/s      2 ms          
Memory Copy                 980843 MB/s   5047 ms          
- 15 MB Block               494234 MB/s      0 ms          
- 32 MB Block               764636 MB/s      0 ms          
- 64 MB Block               980843 MB/s      0 ms          
- 128 MB Block              430759 MB/s      1 ms          
- 256 MB Block              449280 MB/s      1 ms          
- 512 MB Block              458740 MB/s      2 ms          
- 1024 MB Block             463632 MB/s      4 ms          
Single-Precision FLOPS     19177 GFLOPS  16125 ms          
- float1                   19177 GFLOPS    917 ms     1390 ms
- float2                   19059 GFLOPS    923 ms     1312 ms
- float4                   19087 GFLOPS    922 ms     1390 ms
- float8                   18538 GFLOPS    949 ms     1375 ms
- float16                  18446 GFLOPS    954 ms     1297 ms
Double-Precision FLOPS      1173 GFLOPS  14734 ms          
- double1                   1098 GFLOPS    501 ms     1375 ms
- double2                   1142 GFLOPS    963 ms     1453 ms
- double4                   1113 GFLOPS    988 ms     1391 ms
- double8                   1173 GFLOPS    938 ms     1375 ms
- double16                  1058 GFLOPS    520 ms     1406 ms
24-bit Integer IOPS         18995 GIOPS  31937 ms          
- int1                      17254 GIOPS    510 ms     1985 ms
- int2                      17599 GIOPS   1000 ms     2265 ms
- int4                      18995 GIOPS    926 ms     2937 ms
- int8                      18524 GIOPS    950 ms     4844 ms
- int16                     18034 GIOPS    976 ms     9000 ms
32-bit Integer IOPS          3899 GIOPS  13156 ms          
- int1                       3899 GIOPS    564 ms     1484 ms
- int2                       3767 GIOPS    584 ms     1343 ms
- int4                       3721 GIOPS    591 ms     1375 ms
- int8                       3360 GIOPS    654 ms     1438 ms
- int16                      3726 GIOPS    590 ms     1594 ms
64-bit Integer IOPS          1020 GIOPS  20937 ms          
- long1                      1020 GIOPS    539 ms     1547 ms
- long2                       978 GIOPS    562 ms     1407 ms
- long4                       882 GIOPS    624 ms     1547 ms
- long8                       941 GIOPS    584 ms     2235 ms
- long16                      834 GIOPS    659 ms     4984 ms
AES-256                      56075 MB/s  43468 ms          
- inline loop / 4 MB         54504 MB/s    601 ms     1485 ms
- inline loop / 8 MB         55424 MB/s    591 ms          
- inline loop / 16 MB        55855 MB/s    587 ms          
- inline loop / 32 MB        55884 MB/s    586 ms          
- inline loop / 64 MB        55993 MB/s    585 ms          
- inline loop / 128 MB       56059 MB/s    585 ms          
- inline loop / 256 MB       56075 MB/s    584 ms          
- inline unroll / 4 MB       54530 MB/s    601 ms     1390 ms
- inline unroll / 8 MB       55421 MB/s    591 ms          
- inline unroll / 16 MB      55833 MB/s    587 ms          
- inline unroll / 32 MB      55929 MB/s    586 ms          
- inline unroll / 64 MB      55976 MB/s    585 ms          
- inline unroll / 128 MB     56042 MB/s    585 ms          
- inline unroll / 256 MB     56051 MB/s    585 ms          
- define loop / 4 MB         54501 MB/s    601 ms     1437 ms
- define loop / 8 MB         55399 MB/s    591 ms          
- define loop / 16 MB        55843 MB/s    587 ms          
- define loop / 32 MB        55875 MB/s    586 ms          
- define loop / 64 MB        55979 MB/s    585 ms          
- define loop / 128 MB       56029 MB/s    585 ms          
- define unroll / 4 MB       54499 MB/s    601 ms     1438 ms
- define unroll / 8 MB       55389 MB/s    592 ms          
- define unroll / 16 MB      55826 MB/s    587 ms          
- define unroll / 32 MB      55889 MB/s    586 ms          
- define unroll / 64 MB      55982 MB/s    585 ms          
- define unroll / 128 MB     56034 MB/s    585 ms          
SHA-1 Hash                  117599 MB/s  16500 ms          
- 512 blocks                 13860 MB/s    577 ms     1672 ms
- 1024 blocks                27184 MB/s    589 ms          
- 2048 blocks                51839 MB/s    617 ms          
- 4096 blocks                93467 MB/s    685 ms          
- 8192 blocks               117599 MB/s    544 ms          
- 16384 blocks               83597 MB/s    766 ms          
- 32768 blocks               64092 MB/s    999 ms          
- 65536 blocks               50040 MB/s    639 ms          
- 131072 blocks              47162 MB/s    679 ms          
Single-Precision Julia         3904 FPS  15485 ms          
- float1 break                 2123 FPS    965 ms     1297 ms
- float1 stay / unroll 3       3904 FPS    525 ms     1282 ms
- float2 stay / unroll 3       3499 FPS    585 ms     1329 ms
- float1 stay / unroll 9       3531 FPS    580 ms     1313 ms
- float2 stay / unroll 9       3882 FPS    528 ms     1281 ms
- float4 stay / unroll 9       3210 FPS    638 ms     1328 ms
Double-Precision Mandel       394.7 FPS  13625 ms          
- double1 break               379.1 FPS    675 ms     1312 ms
- double1 stay / unroll 3     393.4 FPS    651 ms     1297 ms
- double2 stay / unroll 3     317.8 FPS    805 ms     1296 ms
- double1 stay / unroll 9     394.7 FPS    649 ms     1313 ms
- double2 stay / unroll 9     343.0 FPS    746 ms     1297 ms

Card is OC as far as current limits allow, 2600MHz Core and 1150MHz RAM.

Almost 1.0 TB/s on memory copy on a 64 MB block. Nice.

But 32-bit integer IOPS is not as expected. It should be the same as single-precision FLOPS, but it's 1/5 ratio.

Bouowmx said:
RDNA 2 INT32 throughput is the same as FP32, like NVIDIA Turing. Any review with AIDA64 GPGPU test to verify?
View attachment 34099

lightmanek · Nov 22, 2020

Bouowmx said:
Almost 1.0 TB/s on memory copy on a 64 MB block. Nice.

But 32-bit integer IOPS is not as expected. It should be the same as single-precision FLOPS, but it's 1/5 ratio.

Correct, I've spotted it too regarding 32int.
Wonder if this is another driver bug or deliberate limit.

As for mem copy, effect of IC is clear to see here.
My card is super stable at 2600/1100 as it seems 1150MHz mem with Fast timings introduces errors.

BTW I did a quick Unigine Heaven 4 and this card is 94% faster in it at 1440p compared to Radeon VII also overclocked. Average GPU power draw during that run was 220W with peaks at 240W and 2550MHz actual GPU clock (reading from driver).

Shivansps · Nov 22, 2020

There is any info on when the 6700XT comes out?

MrTeal · Nov 22, 2020

lightmanek said:
Correct, I've spotted it too regarding 32int.
Wonder if this is another driver bug or deliberate limit.

As for mem copy, effect of IC is clear to see here.
My card is super stable at 2600/1100 as it seems 1150MHz mem with Fast timings introduces errors.

BTW I did a quick Unigine Heaven 4 and this card is 94% faster in it at 1440p compared to Radeon VII also overclocked. Average GPU power draw during that run was 220W with peaks at 240W and 2550MHz actual GPU clock (reading from driver).

It's weird that int24 is at the same rate as FP32 though if it's an actual hardware limitation, isn't it?

Bouowmx · Nov 22, 2020

MrTeal said:
It's weird that int24 is at the same rate as FP32 though if it's an actual hardware limitation, isn't it?

The format of IEEE 754 single-precision means it can emulate without precision loss 24-bit integers: 23 significand bits + 1 exponent bit.
AMD GCN has had this capability:

Source: https://www.reddit.com/r/Amd/comments/b3sr8k/amd_radeon_rx_590_vs_nvidia_geforce_gtx_1060_6gb/

Tup3x · Nov 22, 2020

GaiaHunter said:
Translation - Minecraft and Doom RTX.

RT didn't make a compelling case for 2000 series over 1000 series. RT didn't even make a compelling case for 3000 series over 2000 series - it was all about the extra performance at a cheaper price.

You surely meant Quake 2 but it would be nice if they would do ray traced Doom 3.

GaiaHunter · Nov 22, 2020

Tup3x said:
You surely meant Quake 2 but it would be nice if they would do ray traced Doom 3.

That is how interested I am...

Mopetar · Nov 22, 2020

CastleBravo said:
The way I see it, the +-10% difference in 1080p to 4k rasterization performance is close enough as to make no real difference to the end user, but ray tracing performance is significantly higher on Ampere in every title that has RT effects which are complex enough to make a noticeable difference in image quality.

It makes a more noticeable difference in terms of frame rate which unless you're running the high-end cards starts to resemble a PowerPoint presentation more than a game. You're also arguing about a small handful of games that even have any RT effects in the first place, most of which probably make even less of a real difference to the end user.

Shivansps said:
There is any info on when the 6700XT comes out?

Nothing solid, but we could probably look at the information that's been leaking out and compare it to when we started getting similar information about Navi 21 in order to make a better estimate. If I had to guess, we won't get any announcement about a release until CES. AMD will want to have something to announce there and Navi 22 seems like a prime candidate, and I wouldn't mind something about a Zen 3 Threadripper either.

PhoBoChai · Nov 22, 2020

CastleBravo said:
Exactly which "non-NV sponsored RT game" does the 6800 XT compete very well in? In Dirt 5, the RT effects are minimal, and there is obviously something else going on since the AMD card has a massive 30% lead in non-RT FPS. In Watch Dogs Legion, RT seems to be bugged with AMD's card either not actually doing RT, or doing it at much lower quality than Ampere despite equal in-game settings.

Dirt 5, AMD leads with RT off, and with RT on. RT on vs off, the perf hit is only 20%, less than the perf hit of Ampere & Turing. It's RT shadows, just like RT Shadows in COD Black Ops, Warzone, or Shadow of the Tomb Raider and others. Calling it minimal because its just RT shadows ignore the massive perf hits that RTX shadows cause in other games.

In Rift Breakers, RT perf of 6800XT is very competitive vs 3080.

Godfall RT perf is solid on 6800XT.

Here's another RT on vs off game that just came out, WoW's latest xpac.

Check out 6800XT wrecking even the 3090 with RT ON.

And it's a different situation here, because RT OFF, the 6800XT was behind the 3090 (as it should be).

I think it's quite clear from these 4 new RT games, that when AMD help devs optimize for RDNA2, it can handle ray tracing very well. But when it runs NV sponsored RTX games, it falls flat on it's face.

Edit: Your example of Watch Dogs Legion, it is RTX & NV sponsored. It's broken on RDNA2.

PhoBoChai · Nov 22, 2020

GaiaHunter said:
Translation - Minecraft and Doom RTX.

RT didn't make a compelling case for 2000 series over 1000 series. RT didn't even make a compelling case for 3000 series over 2000 series - it was all about the extra performance at a cheaper price.

I think folks are not being logical when they say RT makes a huge difference for Minecraft or Quake RTX. It's because those games by default are so simplistic. If one were to enhance them with modern raster techniques for global illumination, dynamic lighting & shadows, SSR, and godrays etc, you can make a very pretty version of these games while still running at 300 FPS. Unlike Quake RTX where it goes from over 1000 FPS to 60.

RT only makes a difference when it goes above & beyond what is capable through raster.

Saying RTX Minecraft is somehow awesome, is like comparing RTX on vs off as NVIDIA did in the Turing launch, where off, has all the lighting & shadows disabled completely. lol

GaiaHunter · Nov 22, 2020

PhoBoChai said:
I think folks are not being logical when they say RT makes a huge difference for Minecraft or Quake RTX. It's because those games by default are so simplistic. If one were to enhance them with modern raster techniques for global illumination, dynamic lighting & shadows, SSR, and godrays etc, you can make a very pretty version of these games while still running at 300 FPS. Unlike Quake RTX where it goes from over 1000 FPS to 60.

It certainly feels like physX 2.0.

Mopetar · Nov 22, 2020

PhoBoChai said:
Saying RTX Minecraft is somehow awesome, is like comparing RTX on vs off as NVIDIA did in the Turing launch, where off, has all the lighting & shadows disabled completely. lol

Honestly, when the pictures for RTX Minecraft came out I thought half of them actually resulted in a worse looking aesthetic for the game, so I'm not even sure that I'd want to run it even if played Minecraft. But too each their own I suppose.

Leeea · Nov 22, 2020

Are we going through another generation where ray tracing is still a gimmick feature?

I am getting the impression that Ray Tracing is only useful to add nice lighting to games that did not have it to begin with.

JasonLD · Nov 22, 2020

moinmoin said:
AMD Fine Wine™ obviously.
More seriously, the fact that it's the same tech as the consoles that are to stay on the market for at least the next half a decade, if not even longer. Those consoles will the target for most new AAA games during that time, so that's where much of the tech will be pushed.

I think Fine Wine might favor Nvidia this time around since they went more compute heavy this time. This is opposite of GCN vs Maxwell/Pascal generation.
Previous generation was all AMD yet it didn't translate into AAA titles getting optimized more for AMD GPUs. AMD will need to focus on PC gaming in terms of developer support or things won't be much different from last generation.

zinfamous · Nov 22, 2020

JasonLD said:
I think Fine Wine might favor Nvidia this time around since they went more compute heavy this time. This is opposite of GCN vs Maxwell/Pascal generation.
Previous generation was all AMD yet it didn't translate into AAA titles getting optimized more for AMD GPUs. AMD will need to focus on PC gaming in terms of developer support or things won't be much different from last generation.

this is...literally the exact opposite of how things turned out through the previous 2 generations, lol. Later GCN (Polaris, Vega) never really "fine wined" like all of their previous gens, even before GCN.

If nVidia thinks that going all compute is the answer for games, then they weren't paying attention.

Mopetar · Nov 22, 2020

JasonLD said:
I think Fine Wine might favor Nvidia this time around since they went more compute heavy this time. This is opposite of GCN vs Maxwell/Pascal generation.

I think AMD "fine wine" was a bit of a product of GCN sticking around for such a long while and making it easy to maintain support for their older cards without too much of a work commitment. I suspect it also has a little bit to do with Nvidia being willing to stop spending as much effort on their older cards that also allows for this perception of AMD aging so much better.

It was really with Kepler (rather than Maxwell or Pascal) where Nvidia moved away from putting as much focus on compute in their gaming cards just as AMD was launching GCN which incorporated more compute in response to what Nvidia had done in previous generations. Nvidia having architectures that were much more gaming focused and weren't held back by any constraints to enable better compute performance generally faired better than their AMD counterparts of the time. While some of that could be attributed to AMD growing more and more cash-starved, I don't think that's the main reason.

Now we see a similar situation, where Nvidia has created one architecture that's designed to combine compute and gaming, while AMD has separated them off. I don't think it's too surprising that we're seeing this same result, just like AMD failed with Bulldozer when they attempted a design similar to Intel's NetBurst with a longer pipeline and higher clock speeds that ultimately faired just as poorly.

Saylick · Nov 23, 2020

Mopetar said:
I think AMD "fine wine" was a bit of a product of GCN sticking around for such a long while and making it easy to maintain support for their older cards without too much of a work commitment. I suspect it also has a little bit to do with Nvidia being willing to stop spending as much effort on their older cards that also allows for this perception of AMD aging so much better.

It was really with Kepler (rather than Maxwell or Pascal) where Nvidia moved away from putting as much focus on compute in their gaming cards just as AMD was launching GCN which incorporated more compute in response to what Nvidia had done in previous generations. Nvidia having architectures that were much more gaming focused and weren't held back by any constraints to enable better compute performance generally faired better than their AMD counterparts of the time. While some of that could be attributed to AMD growing more and more cash-starved, I don't think that's the main reason.

Now we see a similar situation, where Nvidia has created one architecture that's designed to combine compute and gaming, while AMD has separated them off. I don't think it's too surprising that we're seeing this same result, just like AMD failed with Bulldozer when they attempted a design similar to Intel's NetBurst with a longer pipeline and higher clock speeds that ultimately faired just as poorly.

Yeah, GCN's FineWine or any GPU architecture having the potential to be FineWine isn't a function of how much an architecture is compute focused; it's a product of optimization over time. The reason why Kepler did so bad in the long run is the same reason why AMD ditched Terascale: both of those architectures threw a ton (and I mean a ton) of execution units at the problem while relying on primarily instruction level parallelism (ILP) via a software compiler to keep the units fed. If the compiler was unoptimized or out of date, performance fell off the cliff. AMD and Nvidia tackled this problem in their subsequent architectures following Terascale and Kepler in different ways.

AMD ditched the software scheduler entirely and went with something closer to Fermi by adding back a hardware scheduler and using thread level parallelism (TLP) to keep the units fed. For compute related tasks, which is what GCN and Fermi were designed to tackle, it's harder to extract ILP ahead of time via a software scheduler/compiler because compute workloads are typically heavy with dependent instructions, so it's better to use a CPU-esque approach to keeping GPU utilization high by simply switching threads when one is bogging down.

Nvidia, on the other hand, tackled Kepler's crummy utilization issues from the other side of the spectrum. Instead of trying to fix the problem from the compiler/scheduling side, they tackled the problem from the silicon side. They noticed that an SMX containing 4 multiple warp schedulers presiding over a shared bank of 192 ALUs was simply harder to keep fed since a warp was 32 threads and if you just multiply the number of warp schedulers by 32, you'd get 128 threads per clock. Having all 192 ALUs being constantly fed by only 4 warp schedulers intuitively was going to be an issue because there were simply too many mouths to feed relative to the number of hands feeding them. Also, it was an "all or nothing" ordeal with Kepler in the sense that the entire SMX was one cohesive block; there was no granularity smaller than the SMX itself. If you only needed to use a quarter of the execution units, the entire SMX and all of it's logic needed to be turned on, leaving the remainder more or less idle while burning energy for no reason. Maxwell addressed this by partitioning the SMX into 4 smaller blocks, each containing only 1 warp scheduler and 32 ALUs, which meant that the SMM only had 128 ALUs instead of Kepler's SMX containing 192 ALUs. In theory, if all else remained the same, this overall reduction in ALUs per SMM would mean a degradation in performance, and on a per SM basis it did, but the fact was that an SMM with 128 ALUs could provide 90% of the performance as an SMX with 192 ALUs. This was just a testament to how underutilized Kepler's SMX was if the compiler wasn't sharp enough to work around it. Nvidia then just threw more SMMs into the GPU along with more advanced memory compression and voila, you get 35% IPC gains and higher clocks with Maxwell due to the better energy efficiency of the SMM.

Going back to explaining FineWine: as history showed, GCN, while great for compute, had its problems with gaming workloads since it has to execute a full warp (64 threads) over 4 cycles, which in latency sensitive workloads, like gaming, had detrimental effects. It took years of AMD sticking with GCN for the driver optimizations to mature enough for GCN to really start to shine. Kepler, on the other hand, had fundamental architecture issues and got left in the dust once Maxwell came out. It's almost as if Kepler swung too far to the Terascale side of the spectrum and Maxwell reeled it back. From a scheduling standpoint, I don't believe Maxwell or Pascal or any modern Nvidia GPU architecture uses a hardware scheduler, but the underlying ratio of 1 warp scheduler per group of 32 ALUs pretty much hasn't changed since Maxwell, so in a way, they've already had years worth of "FineWining" built in, even if the architecture itself has changed. If someone knows if any modern Nvidia architecture still uses pure software scheduling, I would love to be informed.

Dribble · Nov 23, 2020

Mopetar said:
... then NVidia is obviously the superior product, but if you're a eSport streamer that just wants the fastest 1080p results AMD is clearly ahead...

This is just a silly argument.
1) even eSport gamers don't care that much about fps when it's > 200 for both cards.
2) they turn the settings right down so there's less scenery and special effects making it harder to aim so even a last gen card is gonna run at stupid fps.
3) they would take nvidia's input lag reduction/measurement stuff over anything else as that is the best way of improving your setup when you are already running at silly fps.

Guru · Nov 23, 2020

Dribble said:
This is just a silly argument.
1) even eSport gamers don't care that much about fps when it's > 200 for both cards.
2) they turn the settings right down so there's less scenery and special effects making it harder to aim so even a last gen card is gonna run at stupid fps.
3) they would take nvidia's input lag reduction/measurement stuff over anything else as that is the best way of improving your setup when you are already running at silly fps.

If you are a professional gamer and have a 240hz monitor, you might want more fps, even 300fps on average, because you'd still get lows in the 250 range.

But both amd and nvidia have latency reducing technologies implemented. it would be interesting to see someone test them and make a comparison!

GaiaHunter · Nov 23, 2020

I think AMD finewine has more to do with NVIDIA way of operating than with what AMD does.

NVIDIA had/has more software engimeers than AMD, so that means if you are current NVIDIA gen you will have extremely well optimized games. But once you are behind the current gen, the optimizations aren't there for the new games.

Qwertilot · Nov 23, 2020

Mopetar said:
Now we see a similar situation, where Nvidia has created one architecture that's designed to combine compute and gaming, while AMD has separated them off.

This is a tiny bit simplistic, I think? Both companies have one huge compute focused chip.

The biggest difference is that NV have put a lot of machine learning acceleration into nearly their entire line up, from A100 down. There's a huge market for that these days.

As to whether you think that's worth it for their gaming chips? You have to decide if you're going to 'allow' DLSS or not. If not, then no. If you do then its hugely worthwhile in terms of the extra performance.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Diamond Member

Senior member

Member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Member

Member

Diamond Member

Diamond Member

Diamond Member

Senior member

No Lifer

Diamond Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Golden Member