Doom to use Asynchronous Compute.

3DVagabond · Mar 21, 2016

TheRyuu said:
Hyper-Q (with CUDA) does support concurrent graphics + compute[1] but it's not compatible with DX12. Apparently it may have something to do with resource barriers and the differences between CUDA and the DX12 equivalent.

I'm not sure how great a difference there is between the two but the conversation was referring to a time when DX12 was just announced or the only thing we had to compare it to was Mantle. I'm not saying it was handled in the best possible way but I think we have to at least be fair that there existed a feature that theoretically could've provided async compute when we had no additional information as to the capabilities of the hardware or the DX12 API.

[1] http://ext3h.makegames.de/DX12_Compute.html

Thanks.

airfathaaaaa · Mar 21, 2016

well ms introduced fences fences that nvidia cant use because once the process is on the cores it cant be stopped or flip or jump to another engine like amd does its a straightforward design

Silverforce11 · Mar 21, 2016

TheRyuu said:
Hyper-Q (with CUDA) does support concurrent graphics + compute[1] but it's not compatible with DX12. Apparently it may have something to do with resource barriers and the differences between CUDA and the DX12 equivalent.

There's no proof which suggest it is capable of concurrent graphics + compute. Hyper-Q is a CUDA implementation and it is to handle 32 compute queues in parallel running through the single engine design of Kepler & Maxwell.

This isn't a limitation to DX12, it also is a limitation in Vulkan.

Once more info on Pascal is available, if it's got multi-engines, such as ACEs, we will know for sure that its hardware capable.

TheRyuu · Mar 22, 2016

Silverforce11 said:
There's no proof which suggest it is capable of concurrent graphics + compute. Hyper-Q is a CUDA implementation and it is to handle 32 compute queues in parallel running through the single engine design of Kepler & Maxwell.

Isn't PhysX an example of it though? You may very well be right I don't have experience programming with GPU's so I can't say first hand if it's true or not. I was under the impression that PhysX basically acted like async compute using CUDA without needing the expensive context switches (because of Hyper-Q, which I believe can be configured to use a 1 graphics + 32 compute capability).

In any case the discussion is sort of moot because of the design of the architecture. You probably won't see the kind of gains you would see on the AMD side even if the expensive context switch wasn't required (although it would likely not result in the performance reduction we've seen with it, it just wouldn't be as much of a positive). I would imagine that their next gen arch would be async compute capable in hardware but that's still speculation I suppose (although I would say you can put your money on it that it has it).

Silverforce11 · Mar 22, 2016

TheRyuu said:
Isn't PhysX an example of it though?

No, because GPU PhysX has always had a reputation for tanking frame rate. We can see it back in the old games as well as recent stuff. I remember my 670 dropping frame rate in Borderlands, to a slideshow during combat due to GPU PhysX effects.

Parallel execution would imply not such a severe performance drop.

TheRyuu · Mar 22, 2016

Silverforce11 said:
No, because GPU PhysX has always had a reputation for tanking frame rate. We can see it back in the old games as well as recent stuff. I remember my 670 dropping frame rate in Borderlands, to a slideshow during combat due to GPU PhysX effects.

Parallel execution would imply not such a severe performance drop.

"Tanking frame rate" doesn't disprove the point I was trying to make though. Also your example GPU wouldn't be able to do it in any case since it wouldn't have Hyper-Q (only GK110 and Maxwell2 have it I believe). The only point I was trying to make was that it avoids the costly context switch because it's using CUDA and it actually supports 1 graphics + 32 compute queues. You may very well still see a performance decrease from using GPU PhysX, I'm not arguing that.

ThatBuzzkiller · Mar 22, 2016

PhysX is NOT an example of async compute ...

CUDA does NOT give Nvidia "async compute", their hardware MUST provide the dedicated compute engines to be able to interleave commands from both the graphics and compute queue ...

End of story ...

TheRyuu · Mar 22, 2016

ThatBuzzkiller said:
PhysX is NOT an example of async compute ...

CUDA does NOT give Nvidia "async compute", their hardware MUST provide the dedicated compute engines to be able to interleave commands from both the graphics and compute queue ...

End of story ...

What is this[1] referring to then:

With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues iin hardware as you would expect.

I admit that it sounds like it's rather course grained and you would likely not have the control that you would get with DX12 async compute. Unless I am somehow misunderstanding what's happening with respect to CUDA execution on the relevant hardware...

[1] http://ext3h.makegames.de/DX12_Compute.html (from the Nvidia section)

Mahigan · Mar 22, 2016

TheRyuu said:
What is this[1] referring to then:

I admit that it sounds like it's rather course grained and you would likely not have the control that you would get with DX12 async compute. Unless I am somehow misunderstanding what's happening with respect to CUDA execution on the relevant hardware...

[1] http://ext3h.makegames.de/DX12_Compute.html (from the Nvidia section)

You're right.

CUDA applications support Asynchronous compute via Hyper-Q and since PhysX is CUDA based, it supports Asynchronous compute + graphics on Maxwell (GM20x).

Hyper-Q isn't compatible with DX12 barriers and or fences (we're not sure which one) which is why GM20x doesn't support Async compute + graphics under DX12.

Hyper-Q bypasses the Command Processor in GM20x and is handled by a dedicated ARM processor on the GM20x die. This dedicated ARM processor can feed both 3D jobs and compute jobs concurrently and in parallel to GM20x.

It is more than likely that NVIDIA were caught by a minor DX12 API spec.

As for context switches, they occur in two stages.
1. During the execution of work loads.
2. Within the SMMs themselves.

The first can be alleviated by use of Hyper-Q in CUDA applications but the second cannot due to the shared L1 texture/compute caches within an SMM. Basically, an SMM cannot be performing both compute and texture jobs at once due to shared logic. A full flush is required to switch from one context to another within an SMM.

TheRyuu · Mar 22, 2016

Mahigan said:
CUDA applications support Asynchronous compute via Hyper-Q and since PhysX is CUDA based, it supports Asynchronous compute + graphics on Maxwell (GM20x).

Isn't it supported on the big kepler chips (GK110) as well? Or did they not support simultaneous graphics + compute yet via Hyper-Q?

Mahigan · Mar 22, 2016

TheRyuu said:
Isn't it supported on the big kepler chips (GK110) as well? Or did they not support simultaneous graphics + compute yet via Hyper-Q?

On Kepler, only graphics or compute under Hyper-Q. The two could not be done together.

Silverforce11 · Mar 22, 2016

So NV added an ARM core to act as an enhanced Hyper-Q for Maxwell 2?

They removed the hardware scheduler and then added an ARM core to do some of that task for CUDA... hmm.

Interesting, have to read more into it.

TheRyuu · Mar 22, 2016

Mahigan said:
On Kepler, only graphics or compute under Hyper-Q. The two could not be done together.

That's what I thought. Just couldn't find the documentation that stated that although I seem to remember something about how Kepler was either 1 graphics or 32 compute and Maxwell was 1 graphics + 32 compute with a comparison between the two. Of course I can't find it now when I want it.

Mahigan · Mar 22, 2016

Silverforce11 said:
So NV added an ARM core to act as an enhanced Hyper-Q for Maxwell 2?

They removed the hardware scheduler and then added an ARM core to do some of that task for CUDA... hmm.

Interesting, have to read more into it.

Good luck finding documentation on it

Folks at Beyond3D had to play around with Hyper-Q in order to deduce this.

Silverforce11 · Mar 22, 2016

Mahigan said:
Good luck finding documentation on it

Folks at Beyond3D had to play around with Hyper-Q in order to deduce this.

Lol you're right, there's nothing official on the uarch that goes deep. -_- Secret sauce and all, squeezing DX12/Vulkan Async Compute support out of nothing.

If the ARM core is offloading and handling compute in parallel, potentially GPU PhysX (CUDA) can be used on Maxwell 2, with minimal perf drops compared to Kepler. In theory, yes?

I also wonder why they did it, since CUDA isn't graphics based for HPC, there would be no need for that feature as its useless in DX11, not used in CUDA, the only common use case scenario are games with GPU PhysX.

Mahigan · Mar 22, 2016

The use I can see, from this, is for medical rendering type jobs... On the consumer end, only GPU PhysX, which is, all but dead.

airfathaaaaa · Mar 22, 2016

Silverforce11 said:
Lol you're right, there's nothing official on the uarch that goes deep. -_- Secret sauce and all, squeezing DX12/Vulkan Async Compute support out of nothing.

If the ARM core is offloading and handling compute in parallel, potentially GPU PhysX (CUDA) can be used on Maxwell 2, with minimal perf drops compared to Kepler. In theory, yes?

I also wonder why they did it, since CUDA isn't graphics based for HPC, there would be no need for that feature as its useless in DX11, not used in CUDA, the only common use case scenario are games with GPU PhysX.

test bed possibly for the future to unify perhaps some of their mobile cpus to gpu's?(kinda stupid but you never know with that company anymore)

ThatBuzzkiller · Mar 22, 2016

Hyper-Q isn't even comparable to DX12 async compute ...

It's meant for running multiple compute queues in parallel, not running graphics and compute queues in parallel ...

If Nvidia truly had the ability to do async compute they would have already released drivers to take advantage of this model ...

Madpacket · Mar 22, 2016

Sadly none of this matters. When AMD defined how asynchronous computing would work with Mantle it was game over for Nvidia as the bits from Mantle that mattered evolved into Vulkan / DX12 (<-- the only library that really matters) AMD has clearly gamed Nvidia and it's paying off big time. Doom is just another example of this.

No game developer cares how Nvidia implemented hardware async. If their implementation doesn't conform to the standard (set by AMD) and can't be called from a standard library it simply won't be used.

Pascal better have a correct implementation ready.

Piroko · Mar 22, 2016

Mahigan said:
Hyper-Q bypasses the Command Processor in GM20x and is handled by a dedicated ARM processor on the GM20x die. This dedicated ARM processor can feed both 3D jobs and compute jobs concurrently and in parallel to GM20x.

It is more than likely that NVIDIA were caught by a minor DX12 API spec.

Why did TruForm just pop into my brain after I read this?

TheRyuu · Mar 23, 2016

ThatBuzzkiller said:
Hyper-Q isn't even comparable to DX12 async compute ...

It's meant for running multiple compute queues in parallel, not running graphics and compute queues in parallel ...

If Nvidia truly had the ability to do async compute they would have already released drivers to take advantage of this model ...

You're probably right about the original intentions of Hyper-Q being a compute only thing. Nevertheless it can be used in a 1 graphics + 32 compute mode on Maxwell2. As has been previously mentioned there's likely an incompatibility with DX12 async with respect to barriers or fences so you're probably right that true hardware async capability may not be possible with Maxwell.

It all comes down to the tradeoffs that Nvidia has made with Kepler and later Maxwell being stuck at 28nm for as long as we have. They decided to make these tradeoffs to value power and efficiency over the more advanced compute capabilities. I don't necessarily think it's a knock on Nvidia but it does leave something to be desired in the coming DX12 era.

Silverforce11 · Mar 23, 2016

Madpacket said:
Sadly none of this matters. When AMD defined how asynchronous computing would work with Mantle it was game over for Nvidia as the bits from Mantle that mattered evolved into Vulkan / DX12 (<-- the only library that really matters) AMD has clearly gamed Nvidia and it's paying off big time. Doom is just another example of this.

No game developer cares how Nvidia implemented hardware async. If their implementation doesn't conform to the standard (set by AMD) and can't be called from a standard library it simply won't be used.

Pascal better have a correct implementation ready.

I admit it was a longer chess game.

But, it's not working out so well for them, record loss and low marketshare.

Rather than Pascal better live up to our standards, it should be, Polaris better knock it out of the park, and Zen is reality and not just hype.

Genx87 · Mar 23, 2016

TheRyuu said:
You're probably right about the original intentions of Hyper-Q being a compute only thing. Nevertheless it can be used in a 1 graphics + 32 compute mode on Maxwell2. As has been previously mentioned there's likely an incompatibility with DX12 async with respect to barriers or fences so you're probably right that true hardware async capability may not be possible with Maxwell.

It all comes down to the tradeoffs that Nvidia has made with Kepler and later Maxwell being stuck at 28nm for as long as we have. They decided to make these tradeoffs to value power and efficiency over the more advanced compute capabilities. I don't necessarily think it's a knock on Nvidia but it does leave something to be desired in the coming DX12 era.

I suspect Pascal will have DX12 compatible hardware async. I think you are right in if Maxwell was to have hardware async they had to make a game time decision based on the process situation. Does Nvidia want to spend valuable silicon on a tweener DX11\12 card or wait until the next die shrink with Pascal?

Stuka87 · Mar 23, 2016

Genx87 said:
I suspect Pascal will have DX12 compatible hardware async. I think you are right in if Maxwell was to have hardware async they had to make a game time decision based on the process situation. Does Nvidia want to spend valuable silicon on a tweener DX11\12 card or wait until the next die shrink with Pascal?

Or if we follow nVidia's history, they purposely leave features out of cards to make them become obsolete sooner, making their newer cards look better. Kepler is a perfect example of this. Meanwhile GCN1.0 cards are still performing very well in the latest games as AMD designed them for the future.

Madpacket · Mar 23, 2016

Silverforce11 said:
I admit it was a longer chess game.

But, it's not working out so well for them, record loss and low marketshare.

Rather than Pascal better live up to our standards, it should be, Polaris better knock it out of the park, and Zen is reality and not just hype.

AMD has recently clawed back some market-share from Nvidia in discrete graphics. This is an important shift. Polaris already looks great and I have very little to doubt it won't live up to expectations given what's been publicly demonstrated. The maturation of console development and with Mark Cerney's PS4 design (8 Async compute units) is a big reason for the tide changing. Nvidia is starting to really lose mindshare due to their short-sighted approach with Kepler and Maxwell (take off the blinders if you don't see it) so that's why I state they better be ready with Pascal (and they likely will) with proper DX12 Async support.

It takes a while for general population or average gamer to realize what we discuss here on Anadtech forums but not as long as it used to (thanks to social media, reddit etc.). I fully expect AMD to have clawed back even more market-share by next quarter as long as the trend of new releases being better on Radeon continues (even if it takes a week or two for them to fix it).

As for Zen I agree with you there, AMD has a ton to prove and get right or they'll be in serious trouble. I think this is one of the reasons why Raja .K re-branded their graphics division into RTG.

Doom to use Asynchronous Compute.

Lifer

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Lifer

Senior member

Senior member

Golden Member

Platinum Member

Senior member

Diamond Member

Lifer

Lifer

Diamond Member

Platinum Member