• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

[bitsandchips]: Pascal to not have improved Async Compute over Maxwell

Page 21 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Well for one thing the gains of async compute are dependent on how geometry limited you are ...

No, it doesnt. Or better: Maybe on AMD hardware.

If the Xbox One was able to see moderate gains with async compute with a ratio of 768 ALU ops per 1 rasterized triangle then a 1080 should see similar gains like a 290 would since they both have a ratio of 1280 ALU ops per 1 rasterized triangle ...

It doesnt work this way. Geometry performance is limited by the Compute Units on nVidia hardware. Every unit works on one vertics and can output 4 pixel per clock. Each rasterizer can output 16 pixel per clock. So when you look at GP104 it could output 80 pixel per clock because it has 20 compute units. But in the end it has only 64 ROPs and only four rasterizer.

So, to see any gains you need to be graphics limited and not geometry limited.

The fact that the 1080 is having trouble seeing definitive gains means that Nvidia still doesn't have proper hardware support for async compute ...

What is "proper hardware support"? Pascal supports it properly. You need a different workload on nVidia hardware to benefit from it in the same way.
You cant for example overshoot the GPU with compute workload beause there are only so many units on the GPU. With Pascal you can hide compute workload behind graphics but at some point the gains will vanished.
 
A quantum but doesn't flip flop. That's the very nature of quantum superposition. Quantum mechanics is clearly not your cup of tea.

I think the one who needs to read up on quantum computing is you. Bit flips absolutely can and do happen with qubits, however a bit flip with a qubit just means that you flip the probabilities instead of the states (you can also have sign flips with qubits).
 
I think the one who needs to read up on quantum computing is you. Bit flips absolutely can and do happen with qubits, however a bit flip with a qubit just means that you flip the probabilities instead of the states (you can also have sign flips with qubits).


Don't be naive, I was replying to someone that thinks quantum bits flip flop all the time as analogy to my non existent change of opinion on async compute. Do you think he knows the difference between states and amplitudes?! He wrote "quantum qbit".. c'mon 🙂
 
Don't be naive, I was replying to someone that thinks quantum bits flip flop all the time as analogy to my non existent change of opinion on async compute. Do you think he knows the difference between states and amplitudes?! He wrote "quantum qbit".. c'mon 🙂

I don't really care about whether or not his analogy was apt, I just wanted to point out that your claim that qubits don't flip was incorrect.
 
Also I never said async compute is useless, unnecessary, or anything of that sort, quite the contrary.
And yet you still constantly question its purpose, value and effectiveness while ignoring even engine developers who tweet how they like async compute. Reminds me of the anti-DX12 campaign and the anti-Tesselation campaign among others in this forum.
Its value has been proven already, quite spectacularly even, and the best implementation practices will be worked out in the coming years. Just like DX12 and Tesselation.

But who am I to try showing people that they still don't understand (or don't want to understand) async compute?!
You know, good question actually. Who are you to show such confidence in your statements?
 
Only some cards have both DMA engines enabled. Last I checked, it was Quadros and Titans, and it was disabled on Geforce, though that could have changed. It also used to be the case that one engine was dedicated to uploads and one to downloads (when 2 are enabled), though that could also have changed.

I know in CUDA, you could access 2x DMA engines on the pro cards, gives them beastly throughput for working with large scientific datasets as it prevents idling shaders by constantly keeping the transfers going.

There is a problem with Kepler, Maxwell and Pascal that for some reason isn't compatible with the Multi-Engine API currently. Some folks suggest its because NV embeds a small ARM core that acts as a hardware scheduler for CUDA only and this conflicts DX12/Vulkan. The reason it's so speculative is because NV does not release detailed architecture documents.
 
The true experts are the ones that keep ignoring any factual statement I make because they don't know what to answer and make claims on my credibility without having any of their own. We could discuss technical stuff but we aren't because almost no one is engaging me in a technical discussion and the ones that do tend to agree with me (no surprise here).

I am still laughing at the very same experts that yesterday wanted to teach me the meaning of the word "standard" completely ignoring I wrote "de facto standard". I must assume they skipped school altogether?! Not sure, but it's a quite sad state of affairs indeed.
 
I know in CUDA, you could access 2x DMA engines on the pro cards, gives them beastly throughput for working with large scientific datasets as it prevents idling shaders by constantly keeping the transfers going.



There is a problem with Kepler, Maxwell and Pascal that for some reason isn't compatible with the Multi-Engine API currently. Some folks suggest its because NV embeds a small ARM core that acts as a hardware scheduler for CUDA only and this conflicts DX12/Vulkan. The reason it's so speculative is because NV does not release detailed architecture documents.


Source?
 

I dunno about pascal - that's yet to be seen, but it's pretty well known at this point that there's some definition/incompatibility/whatever with kepler/maxwell and the way AMD/DX12 define async compute (as they have a very similar capability in CUDA etc from GK110 onwards).
 
Here is how i see it in perspective.

The VR gear will move with the consoles. Its the type of games suited for those platforms. And the pricing is low here. Only porn can move more gear and software and thats probably some years out.

The games will be made for it. They need to ofload the cpu. There is a need for the functionality here. Like needed not nice to have.
There will probably be +1000% as many with gcn based vr gear in a year as with maxwell/pascal. Because of the ps4 neo and ms xbox.
Now who cares about how the granularity is for the preemption for some fancy highend cards for pc in some slap on solution that is changed in 2 years anyway. Cuda with perfect preemption pixel granularity?
Its for another market and the massmarket is not going that way. Sony and Microsoft ensures that.
We have seen it the last 3 years time. The new consoles will just empasize that development.
 

It's true it wouldn't look as good. The higher a gpu's utilization for a certain application, the less potential benefit there will be for asynchronous compute.

Not useless, but less beneficial. Nobody said it's useless on any hardware.
 
Last edited:
No, it doesnt. Or better: Maybe on AMD hardware

Says who ? You ?

AMD was the one who practically invented async compute's original purpose and spec ...

If you don't like that fact then don't get worked up over it the first place since your favourite IHV is dodgy about the subject at best ...

It doesnt work this way. Geometry performance is limited by the Compute Units on nVidia hardware. Every unit works on one vertics and can output 4 pixel per clock. Each rasterizer can output 16 pixel per clock. So when you look at GP104 it could output 80 pixel per clock because it has 20 compute units. But in the end it has only 64 ROPs and only four rasterizer.

So, to see any gains you need to be graphics limited and not geometry limited.

Now you don't know what you're talking about anymore. How exactly is the performance of the rasterizer and setup engines dependent on the SMs ?

I can't believe the amount of misinformation your spreading, you really don't know about the hardware statistics, do you ?

What is "proper hardware support"? Pascal supports it properly. You need a different workload on nVidia hardware to benefit from it in the same way.
You cant for example overshoot the GPU with compute workload beause there are only so many units on the GPU. With Pascal you can hide compute workload behind graphics but at some point the gains will vanished.

No you *wish* Pascal truly supported it. Pascal might need a heavier amount of rasterizer work to stress the geometry processors enough to be the bottleneck but the effects of async compute is really straight forward ...

You want to overlap work from the rasterizer/texture sampler with the shaders ...
 
Of course he's deliberately misunderstanding. Not only I never talked about perfect utilization, I was also clearly referring to that specific application. To not mention I wrote many times before async compute is a *great feature*. Unfortunately it's also the most opportunistically misunderstood feature.

Jesus H.
fully utilizing.
You realize that fully and perfect are synonymous right? You are being obtuse because logically you know you said some absolutely dumb stuff.
 
Jesus H. You realize that fully and perfect are synonymous right? You are being obtuse because logically you know you said some absolutely dumb stuff.

He said BETTER at fully utilizing. You can't be better or worse at being perfect, it just means better.
 
He said BETTER at fully utilizing. You can't be better or worse at being perfect, it just means better.

Not trying to word nitpicking but you do realize there's no better than being fully at something right?

He could'be phrased "better at utilizing" to avoid contradiction, although that alone would sprout another entirely different discussion.
 
Not trying to word nitpicking but you do realize there's no better than being fully at something right?

He could'be phrased "better at utilizing" to avoid contradiction, although that alone would sprout another entirely different discussion.

That's exactly what I'm saying. Better at fully utilizing doesn't make sense in a literal way. It just means closer to fully utilizing.
 
To take it a step further, utilization isn't static. "Better at fully utilizing" could mean, over time, the utilization stays at 100% more often.
 
I can't believe we have to pull out a dictionary. You cannot be better at being fully. Fully means totally or completely.
 
The original purpose of AC feature is not to run graphics + compute in the SP or Cuda Cores.

It is really all about running Rasterizer & DMA dependent workloads WHILE the SP/CC is running or vice versa. It simply could not be done on older API and hardware not designed for this level of Multi-Engine.

Shader utilization is the least useful aspect of Async Compute. On consoles where they have low SP GPUs, the shaders are not suffering idle times like Fury X does, but it's here were AC has the biggest gains and it's because the devs focus on making Rasterizer & DMA workloads run concurrently. In effect, the entire GPU is being loaded concurrently.

When people who helped designed GCN with AMD talk, Async Compute is never about improving shader utilization since it's NOT a problem with the hardware specs on consoles and the API they use.

Heck, DICE even talks about how they do aggressive scene culling via Compute Shaders while the scene is heavily using the Rasterizer. This Multi-Engine approach is only possible with Async Compute on hardware.

http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/
 
On consoles where they have low SP GPUs, the shaders are not suffering idle times like Fury X does.
...
When people who helped designed GCN with AMD talk, Async Compute is never about improving shader utilization since it's NOT a problem with the hardware specs on consoles and the API they use.

Your Dice example contradicts this...

Heck, DICE even talks about how they do aggressive scene culling via Compute Shaders while the scene is heavily using the Rasterizer.

That is by definition increasing utilization.
 
Last edited:
Back
Top