[AMD_Robert] Concerning the AOTS image quality controversy

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

renderstate

Senior member
Apr 23, 2016
237
0
0
http://semiaccurate.com/forums/showthread.php?p=263871#post263871

I think you should read this post from Fottemberg, about FP16 on Pascal GPUs.

The post is from 29th may. Few days before the AMD presentation, and the benchmarks. We have to wait for effects of investigation by Oxide to conclude anything.



Fottemberg has no idea of what he's talking about. HLSL only supports FP16 variables as minimum precision requirement. Basically a shader might indicate that some variable can be represented using *AT LEAST* FP16 but it doesn't mandate calculations must be performed using FP16 math. If a GPU doesn't support FP16 at all or it has slow support for FP16 then it can simply run everything using FP32 math. (Details here: https://msdn.microsoft.com/en-us/library/windows/desktop/hh968108(v=vs.85).aspx)

If AotS uses FP16 hints (because it makes things faster on GCN due to reduced register pressure) then it's way more likely that the different results between RX480 and 1080 is due to the latter running computations at *higher precision* using FP32 math.

Of course the fanboys of the world are already thinking about conspiracies and cheats when, as usual, they have no clue.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
So basically you have explained everything about FP16, but... you did not contradicted at all to what have said Fottemberg.

Whats more it is most likely that you have confirmed what he meant. Nothing in that spec does not tell anything about FP16 performance on GP104, both native and simulated through FP32.

Secondly, I do not imply that this is confirmation of "conspiracy theory", or anything. We have to wait for what investigation within Oxide of this matter will bring to the table. Fottemberg is/can be onto something. That native FP16 performance may be slower than simulated through FP32 cores on GP104. That is what he meant.

What I meant is that there is quite interesting context to all this, and may be answering the question why Pascal GPUs render the terrain incorrectly using the FP16 shaders. Is it because drivers? Is it because Nvidia changed the code for their GPUs? We have to find out after the investigation within Oxide.
 

renderstate

Senior member
Apr 23, 2016
237
0
0
So basically you have explained everything about FP16, but... you did not contradicted at all to what have said Fottemberg.

Whats more it is most likely that you have confirmed what he meant. Nothing in that spec does not tell anything about FP16 performance on GP104, both native and simulated through FP32.

First of all I have indeed contradicted what he says because what he says is based on his delusions of how FP16 works in DX. He hasn't even bother checking on the MS website.

Second, there is NO SIMULATED FP16 through FP32. Read the specs again. You don't need to simulate FP16 if HW doesn't support it. You just ignore it and perform calculations at full precision. FP16 is a min precision requirement in HLSL, that's all. A compiler can simply throw away all FP16 minprec hints, which is very likely what NVIDIA does. I bet those images looked better on NVIDIA because, surprise, everything was computed in full precision. Not a simulation of FP16 done with FP32, just pure FP32.

Fottemberg is/can be onto something. That native FP16 performance may be slower than simulated through FP32 cores on GP104. That is what he meant.
He is onto nothing because he doesn't understand the specs. You can make an effort yourself and read the page I linked in the previous post.

What I meant is that there is quite interesting context to all this, and may be answering the question why Pascal GPUs render the terrain incorrectly using the FP16 shaders. Is it because drivers? Is it because Nvidia changed the code for their GPUs? We have to find out after the investigation within Oxide.

UFOs are a more likely explanation of anything Fottemberg maliciously hypothesized without understanding how DX works.
 

Abwx

Lifer
Apr 2, 2011
11,888
4,874
136
Second, there is NO SIMULATED FP16 through FP32. Read the specs again. You don't need to simulate FP16 if HW doesn't support it. You just ignore it and perform calculations at full precision. FP16 is a min precision requirement in HLSL, that's all. A compiler can simply throw away all FP16 minprec hints, which is very likely what NVIDIA does. I bet those images looked better on NVIDIA because, surprise, everything was computed in full precision. Not a simulation of FP16 done with FP32, just pure FP32.

If a computation use FP16 format and is done in a FP32 exe unit the result will still be a FP16 word, so i dont see how you can talk of full precision here, but for sure the execution will not be efficient energy wise..

Besides, from where did you get the "image look better" on Nvidia given that the landscape wasnt rendered as it should be according to Hollock..?.
 

96Firebird

Diamond Member
Nov 8, 2010
5,743
340
126
So many words, and nothing said. As always riveting meta commentary. :thumbsdown:

The truth hurts when it slaps ya right in the face, doesn't it. :D

I'm sorry what I posted isn't what you wanted to hear, the only thing you can do about it is put me on ignore. :(
 

renderstate

Senior member
Apr 23, 2016
237
0
0
If a computation use FP16 format and is done in a FP32 exe unit the result will still be a FP16 word, so i dont see how you can talk of full precision here, but for sure the execution will not be efficient energy wise..
No, there is no FP16 format in HLSL for variables, only hints that you can reduce precision if you want/can. Is it really that hard to understand?
From the specs:

Code:
hardware can ignore the minimum precision indicators and run at full 32-bit precision. When your shader code is used on hardware that takes advantage of minimum precision, you use less memory bandwidth and as a result you also use less system power as long as your shader code doesn’t expect more precision than it specified.

There is no strict FP16 requirement and no emulation! you can simply ignore the hints on HW that doesn't support FP16.

Besides, from where did you get the "image look better" on Nvidia given that the landscape wasnt rendered as it should be according to Hollock..?.


I saw some comparisons online and I thought it looked better and more detailed on 1080 but I understand this a subject matter, although it perfectly fits with how FP16 objectively works in DX.
 

Abwx

Lifer
Apr 2, 2011
11,888
4,874
136
No, there is no FP16 format in HLSL for variables, only hints that you can reduce precision if you want/can. Is it really that hard to understand?
From the specs:

Code:
hardware can ignore the minimum precision indicators and run at full 32-bit precision. When your shader code is used on hardware that takes advantage of minimum precision, you use less memory bandwidth and as a result you also use less system power as long as your shader code doesn’t expect more precision than it specified.
There is no strict FP16 requirement and no emulation! you can simply ignore the hints on HW that doesn't support FP16.

Of course, we can make operations using 16bit formats on a 32bit arithmetic unit, it s just that only half of the unit will be effectively used, but the output will still be 16bits numbers, this is like saying that we can do addition of small numbers on a calculator that can handle 10x bigger magnitude numbers...

So what is ignored is that the numbers are 16bits and are just computed by settings the bits from 16 to 31 at a 0 value, that s a 32bit number that do not exceed 16bit max value..


I saw some comparisons online and I thought it looked better and more detailed on 1080 but I understand this a subject matter, although it perfectly fits with how FP16 objectively works in DX.

Actually part of the pic is missing with the 1080 :

At present the GTX 1080 is incorrectly executing the terrain shaders responsible for populating the environment with the appropriate amount of snow...............................................................

The content being rendered by the RX 480--the one with greater snow coverage in the side-by-side (the left in these images)--is the correct execution of the terrain shaders.
http://forums.anandtech.com/showpost.php?p=38264374&postcount=1
 

maddie

Diamond Member
Jul 18, 2010
5,160
5,552
136
No, there is no FP16 format in HLSL for variables, only hints that you can reduce precision if you want/can. Is it really that hard to understand?
From the specs:

Code:
hardware can ignore the minimum precision indicators and run at full 32-bit precision. When your shader code is used on hardware that takes advantage of minimum precision, you use less memory bandwidth and as a result you also use less system power as long as your shader code doesn’t expect more precision than it specified.
There is no strict FP16 requirement and no emulation! you can simply ignore the hints on HW that doesn't support FP16.




I saw some comparisons online and I thought it looked better and more detailed on 1080 but I understand this a subject matter, although it perfectly fits with how FP16 objectively works in DX.
If you are not seeing the terrain as the game designers intended it to look, how can you even begin to make this statement.
 

renderstate

Senior member
Apr 23, 2016
237
0
0
Of course, we can make operations using 16bit formats on a 32bit arithmetic unit, it s just that only half of the unit will be effectively used, but the output will still be 16bits numbers, this is like saying that we can do addition of small numbers on a calculator that can handle 10x bigger magnitude numbers...

So what is ignored is that the numbers are 16bits and are just computed by settings the bits from 16 to 31 at a 0 value, that s a 32bit number that do not exceed 16bit max value..
First of all this is not how floating point math works (just the mantissa and only before performing any math with it). Also this is not what the spec document says so I am not sure why you keep repeating the same wrong stuff. Let me spell it out for you one last time: *there are no FP16 data types for variables in HLSL, only FP16 HINTS that precision can be reduced if the compiler + HW want to do so*.

In DX the compiler can happily THROW AWAY the FP16 hints and pretend the developer never added them in the shader code. The specs clearly explain it! There is no approximation, emulation, cheat or hacks.
 
Feb 19, 2009
10,457
10
76
If you are not seeing the terrain as the game designers intended it to look, how can you even begin to make this statement.

Lots of effects in modern games reduce the image quality in my view, too much post filtering ruins an other-wise detailed texture and sharp visuals. But it's subjective. Though there's no excuse, game rendering should be as the developers intend.

People who have too much faith on NV's specs or feature claims need to be reminded of the wrong hardware specs in the 970, missing ROPs, cache and bandwidth along with the 3.5 segmentation. Likewise feature, Maxwell's infamous lies about Async Compute.. and NV is repeating the same rubbish about Pascal's support of Async Compute.

I mean when you have renown developers like id Software call out that only GCN has TRUE Async Compute, it's bad. It really highlights that NV has FAKE feature support.

It would not be a surprise if FP16 throughput claims only apply to GP100 and not the other Pascal-lite variants such as GP104 and onwards. Note their Pascal architecture claims were based on P100 whitepaper release for the public, and that's GP100 based. Again, take NV's claims with a grain of salt.
 

maddie

Diamond Member
Jul 18, 2010
5,160
5,552
136
Lots of effects in modern games reduce the image quality in my view, too much post filtering ruins an other-wise detailed texture and sharp visuals. But it's subjective. Though there's no excuse, game rendering should be as the developers intend.

People who have too much faith on NV's specs or feature claims need to be reminded of the wrong hardware specs in the 970, missing ROPs, cache and bandwidth along with the 3.5 segmentation. Likewise feature, Maxwell's infamous lies about Async Compute.. and NV is repeating the same rubbish about Pascal's support of Async Compute.

I mean when you have renown developers like id Software call out that only GCN has TRUE Async Compute, it's bad. It really highlights that NV has FAKE feature support.

It would not be a surprise if FP16 throughput claims only apply to GP100 and not the other Pascal-lite variants such as GP104 and onwards. Note their Pascal architecture claims were based on P100 whitepaper release for the public, and that's GP100 based. Again, take NV's claims with a grain of salt.
I'm saving this page to show anyone claiming the most you will get from async-compute is < 10%. 3-5ms on a 16.67ms frame time is an enormous advantage. That's like buying a 980 and getting a 980Ti instead. The Pascal generation will not lose performance like Maxwell, but gains pretty little. Nothing like this.

Tiago Sousa @idSoftwareTiago @dankbaker @AndrewLauritzen @ryanshrout @Roy_techhwood @AMDRadeon Async is awesome, we gained about 3ms up to 5ms on consoles(huge for 60hz)


http://wccftech.com/async-compute-p...tting-performance-target-in-doom-on-consoles/
 

tential

Diamond Member
May 13, 2008
7,348
642
121
Lots of effects in modern games reduce the image quality in my view, too much post filtering ruins an other-wise detailed texture and sharp visuals. But it's subjective. Though there's no excuse, game rendering should be as the developers intend.

People who have too much faith on NV's specs or feature claims need to be reminded of the wrong hardware specs in the 970, missing ROPs, cache and bandwidth along with the 3.5 segmentation. Likewise feature, Maxwell's infamous lies about Async Compute.. and NV is repeating the same rubbish about Pascal's support of Async Compute.

I mean when you have renown developers like id Software call out that only GCN has TRUE Async Compute, it's bad. It really highlights that NV has FAKE feature support.

It would not be a surprise if FP16 throughput claims only apply to GP100 and not the other Pascal-lite variants such as GP104 and onwards. Note their Pascal architecture claims were based on P100 whitepaper release for the public, and that's GP100 based. Again, take NV's claims with a grain of salt.

No one cares when Nvidia consistently has the fastest gpus out.
 

sirmo

Golden Member
Oct 10, 2011
1,014
391
136
Fottemberg has no idea of what he's talking about. HLSL only supports FP16 variables as minimum precision requirement. Basically a shader might indicate that some variable can be represented using *AT LEAST* FP16 but it doesn't mandate calculations must be performed using FP16 math. If a GPU doesn't support FP16 at all or it has slow support for FP16 then it can simply run everything using FP32 math. (Details here: https://msdn.microsoft.com/en-us/library/windows/desktop/hh968108(v=vs.85).aspx)

If AotS uses FP16 hints (because it makes things faster on GCN due to reduced register pressure) then it's way more likely that the different results between RX480 and 1080 is due to the latter running computations at *higher precision* using FP32 math.

Of course the fanboys of the world are already thinking about conspiracies and cheats when, as usual, they have no clue.
I don't know that I agree with Fottemberg's thesis, but you're also missing the point. He's not talking about today's HLSL. He's clearly writing about the upcoming DX12 update with the shader model 6.0.

Which clearly states on MS slide presentation for it, that they are moving away from the min16 hints to native half float scalars: http://techreport.com/r.x/2016_3_23_Microsoft_details_upcoming_Shader_Model_6_features/sm6_1.jpg

Weather this will have negative impact on Pascal and Maxwell remains to be seen however.

I tend to think however that this is purely a language change, and that the compiler will still take the hints like before.. but we won't know until we see the benchmarks.
 
Last edited:

dogen1

Senior member
Oct 14, 2014
739
40
91
I'm saving this page to show anyone claiming the most you will get from async-compute is < 10%. 3-5ms on a 16.67ms frame time is an enormous advantage. That's like buying a 980 and getting a 980Ti instead. The Pascal generation will not lose performance like Maxwell, but gains pretty little. Nothing like this.

I wouldn't get my hopes up that pc hardware will start getting these 30+% boosts. PC games don't get years of optimization for a single spec.
 
Last edited:
Feb 19, 2009
10,457
10
76
No one cares when Nvidia consistently has the fastest gpus out.

Sure, no one cares, that's why NV is forced to shove fake Async Compute in their marketing PR out to the gullible masses.

You are overly AMD pessimistic. AMD's market-share is rising on obsolete stuff. AMD's share prices is going up when they have been under-performing. The turn around is about to happen and everyone can see it coming.
 

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
I saw some comparisons online and I thought it looked better and more detailed on 1080

Your/Our visual preferences are deeply entwined in your/our DNA and pshycy.

I'm leaning towards bad drivers currently. I'm also wondering how the fix effects performance in AoTS as well as performance overall.
 

Slaughterem

Member
Mar 21, 2016
77
23
51
I don't know that I agree with Fottemberg's thesis, but you're also missing the point. He's not talking about today's HLSL. He's clearly writing about the upcoming DX12 update with the shader model 6.0.

Which clearly states on MS slide presentation for it, that they are moving away from the min16 hints to native half float scalars: http://techreport.com/r.x/2016_3_23_Microsoft_details_upcoming_Shader_Model_6_features/sm6_1.jpg

Weather this will have negative impact on Pascal and Maxwell remains to be seen however.
And his point is that this will be in future game titles and Pascal will become outdated in 1- 2 years. So everyone who buys pascal now and in the next 8 months will have to buy Volta in 2018 because it does not support 16 half.
 
Last edited:

sirmo

Golden Member
Oct 10, 2011
1,014
391
136
And his point is that this will be in future game titles and Pascal will become outdated in 1- 2 years. So everyone who buys pascal now and in the next 8 months will have to buy Volta in 2018 because it does not support 16 half.
I do think that GCN is in a better position. Pascal and Maxwell seem to be optimized for today's/legacy workloads.

The fact that AMD has managed to influence the industry with Mantle and that we're now seeing DX12, Vulkan and Metal all practically derived from Mantle as well as the fact that AMD controls game development on consoles, I do think that GCN performance will improve over time with newer titles.

And Nvidia will have no choice but to change their architecture to better match GCN.

Kind of interesting to see, because for as long as I remember Nvidia has had the developers and AMD/ATI had to play catch up, the table looks to be reversed for the first time.
 
Last edited:

renderstate

Senior member
Apr 23, 2016
237
0
0
Still more non sense from people desperate to see their beloved company prevail :)

Even if future versions of DX/HLSL introduce a first class FP16 data type current HW will continue to run in FP32 with up and down conversions when needed. No current discrete consumer level GPU supports FP16 at double rate. The potential advantage in using FP16 is really about lowering register pressure abs possibly reduce power by running computations in half precision. GCN is way more sensitive than Maxwell (and I suspect Pascal) to register pressure and there is no reason for which NVIDIA shader compiler can't fit 2 half precision registers into a single precision one. With regards to future proof HW I'd be more concerned about Polaris apparently not having nothing new to accelerate VR applications, when NVIDIA clearly investing a lot more in that direction. A slightly better async compute implementation won't help you much when your competitor is shading half of the vertices and almost half of the pixels when doing VR rendering. If future VR apps adopt single pass stereo rendering and lens match shading it's going to be brutal..
 

renderstate

Senior member
Apr 23, 2016
237
0
0
BTW, on the same list of future features for HLSL I see a couple of things Pascal already supports at full speed and GCN doesn't (unless RX480 is still reserving some surprise) like stereo system values and programmable blending (albeit with raster order views), to not mention major DX12 features GCN still doesn't support like conservative rasterization.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Still more non sense from people desperate to see their beloved company prevail :)

Even if future versions of DX/HLSL introduce a first class FP16 data type current HW will continue to run in FP32 with up and down conversions when needed. No current discrete consumer level GPU supports FP16 at double rate. The potential advantage in using FP16 is really about lowering register pressure abs possibly reduce power by running computations in half precision. GCN is way more sensitive than Maxwell (and I suspect Pascal) to register pressure and there is no reason for which NVIDIA shader compiler can't fit 2 half precision registers into a single precision one. With regards to future proof HW I'd be more concerned about Polaris apparently not having nothing new to accelerate VR applications, when NVIDIA clearly investing a lot more in that direction. A slightly better async compute implementation won't help you much when your competitor is shading half of the vertices and almost half of the pixels when doing VR rendering. If future VR apps adopt single pass stereo rendering and lens match shading it's going to be brutal..

Yep, Nvidia is quite good at not rendering part of the scene to improve performance as we've seen.
 

sirmo

Golden Member
Oct 10, 2011
1,014
391
136
Still more non sense from people desperate to see their beloved company prevail :)

Even if future versions of DX/HLSL introduce a first class FP16 data type current HW will continue to run in FP32 with up and down conversions when needed. No current discrete consumer level GPU supports FP16 at double rate. The potential advantage in using FP16 is really about lowering register pressure abs possibly reduce power by running computations in half precision. GCN is way more sensitive than Maxwell (and I suspect Pascal) to register pressure and there is no reason for which NVIDIA shader compiler can't fit 2 half precision registers into a single precision one. With regards to future proof HW I'd be more concerned about Polaris apparently not having nothing new to accelerate VR applications, when NVIDIA clearly investing a lot more in that direction. A slightly better async compute implementation won't help you much when your competitor is shading half of the vertices and almost half of the pixels when doing VR rendering. If future VR apps adopt single pass stereo rendering and lens match shading it's going to be brutal..
You are contradicting yourself. Packing two half precision registers into one full precision register would be a waste of cycles. Unless you're describing an existing hardware feature in CUDA cores I can hardly see this being a viable performance strategy. But you're also in the same breath boasting about Nvidia's lower register pressure, so why not use those extra registers then? Instead of packing two half float values into one.

Also you're ignoring GCN's potential advantages in general throughput and cache efficacy when it comes to using 16bit floats natively.

Anyways, I don't pretend I know either way how this will impact Paxwell, but there is definitely a potential for performance regression in future DX12 games. A trend we've been observing for awhile now. And it's not something that either one of us made up.

I can't comment on VR as we literally have no idea of all the changes Polaris has yet. And I don't exactly trust AMD or NVidia about the actual benefits of such features. We'll see in the actual real world benchmarks.
 
Last edited:

renderstate

Senior member
Apr 23, 2016
237
0
0
You are contradicting yourself. Packing two half precision registers into one full precision register would be a waste of cycles. Unless you're describing an existing hardware feature in CUDA cores I can hardly see this being a viable performance strategy. But you're also in the same breath boasting about Nvidia's lower register pressure, so why not use those extra registers then? Instead of packing two half float values into one.
It's not mandatory to pack half precision registers. All I said is that the compiler can do it just fine. In fact I suspect this is already what happens on GCN since the CU don't run at double speed when operating on FP16 data. If packing registers is not a good idea then one can simply not do it.

Also you're ignoring GCN's potential advantages in general throughput and cache efficacy when it comes to using 16bit floats natively.
There's no throughput advantage, that's the whole point I was making. GP100 has it though but it's a tad expensive for gaming ;-)
 

beginner99

Diamond Member
Jun 2, 2009
5,318
1,763
136
You're completely remembering it wrong. SC and SC: Broodwar ran like a champ and was hindered by network performance on 4v4 not CPU. I used to LAN Broodwar and play 8 player battles and it ran beautifully.

Kind fo off-topic but thats' not true. I remember perfectly. Because I was a kid back then and not in charge of hardware we had a pentium 200 mhz for a pretty long time and SC1 4v4 was a no go. It would even start to lag in 1v1 given enough units (carriers...). Then upgrade to P4 and it ran just fine.