Ashes of the Singularity User Benchmarks Thread

TheELF · Sep 2, 2015

Erenhardt said:
Also, do we know what is the 'compute task' we are dealing here with? It may very well be something that is dependent on the architecture/memory/cache capacity/speed/latency and have no relevance in real world performance.

Do we know if the async computing they use in ashes will have any relevance in real world performance?
Meaning any other game except for ashes.

Flapdrol1337 · Sep 2, 2015

Zstream said:
Your definition of faster is interesting.

I was just looking at the quoted numbers.

but isn't 20ms a shorter time than 6.79ms + 16.21ms, making it faster?

Keysplayr · Sep 2, 2015

AtenRa said:
I though they had 80% of Gaming market, how come they didnt have win 10 access before August 2015 ??

Show me in this thread where anybody isn't talking about dGPUs as the primary focus of the discussion. Unless I missed it, any and all charts posted in this thread are dGPUs only, which is where Nvidia DOES have 80% market share.
Not the entire Gaming market. Nobody is showing charts for APUs or IGPs or Consoles in this thread. So your eyeroll is sorely misguided.

Erenhardt · Sep 2, 2015

sontin said:
http://developer.amd.com/community/blog/2015/06/05/concurrency-in-modern-3d-graphics-apis/

Would make it hard to use more queues.

It's not that simple and you're purposefully ignoring provided explanation, deflect and derail the discussion for the sole reason of trolling.

http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1710#post_24368195

Aside from the Async stuff...

Here's what I think they did at Beyond3D:
They set the amount of threads, per kernel, to 32 (they're CUDA programmers after-all).
They've bumped the Kernel count to up to 512 (16,384 Threads total).
They're scratching their heads wondering why the results don't make sense when comparing GCN to Maxwell 2

Here's why that's not how you code for GCN

Why?:
Each CU can have 40 Kernels in flight (each made up of 64 threads to form a single Wavefront).
That's 2,560 Threads total PER CU.
An R9 290x has 44 CUs or the capacity to handle 112,640 Threads total.

If you load up GCN with Kernels made up of 32 Threads you're wasting resources. If you're not pushing GCN you're wasting compute potential. In slide number 4, it stipulates that latency is hidden by executing overlapping wavefronts. This is why GCN appears to have a high degree of latency but you can execute a ton of work on GCN without affected the latency. With Maxwell/2, latency rises up like a staircase with the more work you throw at it. I'm not sure if the folks at Beyond3D are aware of this or not.

Conclusion:

I think they geared this test towards nVIDIAs CUDA architectures and are wondering why their results don't make sense on GCN. If true... DERP! That's why I said the single Latency results don't matter. This test is only good if you're checking on Async functionality.

GCN was built for Parallelism, not serial workloads like nVIDIAs architectures. This is why you don't see GCN taking a hit with 512 Kernels.

What did Oxide do? They built two paths. One with Shaders Optimized for CUDA and the other with Shaders Optimized for GCN. On top of that GCN has Async working. Therefore it is not hard to determine why GCN performs so well in Oxide's engine. It's a better architecture if you push it and code for it. If you're only using light compute work, nVIDIAs architectures will be superior.

This means that the burden is on developers to ensure they're optimizing for both. In the past, this hasn't been the case. Going forward... I hope they do. As for GameWorks titles, don't count them being optimized for GCN. That's a given. Oxide played fair, others... might not.

Erenhardt · Sep 2, 2015

TheELF said:
Do we know if the async computing they use in ashes will have any relevance in real world performance?
Meaning any other game except for ashes.

LOL what?!

I mean, there is some logic in here, because ashes is Sci-Fi kinda themed. Nor a real world sandbox simulation type of game... :awe:

Jubassi · Sep 2, 2015

Awesome, thanks this helped alot! DX12 gain/loss vs DX11 were interesting

Silverforce11 · Sep 2, 2015

Flapdrol1337 said:
But it is a bit faster.

You can't do graphics + compute without any penalty to graphics. It'll at least use extra power, focing the clocks down.

The only thing that matters is the total performance.

You missed the latter part of my post, go here:

Credits to Nub: http://nubleh.github.io/async/#36

That charts out all the results so far. Move your mouse across and look at the times for the graphics or compute individually, then combined async compute, and also time saved. Notice all the Maxwell GPUs doesn't get much faster in async mode than the sum of the individual tasks.

There's a +/- 4ms for each kernel but over the kernals, its basically a sum of graphics & compute for Maxwell trying to operate in "Async Mode". It shouldn't be a sum.

Man it's so simple maths...

Either the app to test it is not doing it right, or Kepler & Maxwell cannot do Async Compute.

monstercameron · Sep 2, 2015

Keysplayr said:
Show me in this thread where anybody isn't talking about dGPUs as the primary focus of the discussion. Unless I missed it, any and all charts posted in this thread are dGPUs only, which is where Nvidia DOES have 80% market share.
Not the entire Gaming market. Nobody is showing charts for APUs or IGPs or Consoles in this thread. So your eyeroll is sorely misguided.

where does nvidia have 80% market share? I thought that was just shipping dgpus in the previous quarter?

VR Enthusiast · Sep 2, 2015

TheELF said:
Do we know if the async computing they use in ashes will have any relevance in real world performance?
Meaning any other game except for ashes.

Yes the Oxide developer said that 30%+ gains have been seen in consoles.

The dev Sebbbi on Beyond3D said gains above 30% aren't difficult.

Keysplayr said:
Show me in this thread where anybody isn't talking about dGPUs as the primary focus of the discussion. Unless I missed it, any and all charts posted in this thread are dGPUs only, which is where Nvidia DOES have 80% market share.
Not the entire Gaming market. Nobody is showing charts for APUs or IGPs or Consoles in this thread. So your eyeroll is sorely misguided.

Do you have a breakdown of Nvidia's sales by graphics card? I'm sure that most sales are slow pre-Maxwell cards that are sold in OEM machines from Dell and HP but still count as a discrete card. So even if Nvidia has 80% discrete market share that number means nothing if 80% of the cards they sell are not faster than APUs.

Silverforce11 · Sep 2, 2015

Oh yeah, there's a few DX12 titles coming (around early 2016), as to whether they use Async Compute or not, you can do your own research:

http://www.vcpost.com/articles/8717...ion-technologies-glass-city.htm#ixzz3kSkxnueB

http://gearnuke.com/rise-of-the-tom...breathtaking-volumetric-lighting-on-xbox-one/

http://gearnuke.com/deus-ex-mankind-divided-use-async-compute-enhance-pure-hair-simulation/

https://www.youtube.com/watch?v=D-epev7cT30

https://youtu.be/7MEgJLvoP2U?t=20m47s

I'm looking forward to Mirror's Edge and Deus Ex for sure, both apparently are due in Feb 2016.

sontin · Sep 2, 2015

Silverforce11 said:
Man it's so simple maths...

Either the app to test it is not doing it right, or Kepler & Maxwell cannot do Async Compute.

You will just ignore the fact that nVidia can launch different kernels in one compute queue in opposite to direct queue?

Have you read what Microsoft wrote about Async Compute?

TheELF · Sep 2, 2015

Erenhardt said:
LOL what?!

I mean, there is some logic in here, because ashes is Sci-Fi kinda themed. Nor a real world sandbox simulation type of game... :awe:

No, because it is a strategy game with an insane amount of units,the only way to get the most out of the parallelism of the amd cards is to use insane amounts of units so that all the ACEs get used.
The only type of game where you can do this is in strategy games like here where you have a lot of small units that follow larger units up to the units the player actually controls.

The only other way would be to raise polygon count by making more detailed models of everything and no dev is gonna go, hey lets make development cost sky rocket by doing much much more work so that the 1% of market share that has killer rigs can be happy about something.

Silverforce11 · Sep 2, 2015

It can certainly launch, but it can't finish them asynchronously in parallel. According to the program at b3d that is.

It's a simple concept.

https://forum.beyond3d.com/posts/1869561/

2 Cars are on the road, let's call them Car 1 (Compute) and Car 2 (Graphics). Both cars are trying to go from A - > B.

The time it takes for Car 1 to travel the journey is 1 hour. The time it takes for Car 2 to travel the journey is 2 hours.

The question is, how long does it take for both Cars to reach destination B?

1. Both Cars can travel on the road together, simultaneously, starting at the same time: 2 hours.
2. Only ONE Car can be on the road at once, so Car 1 goes first (order doesn't matter), finishes, then Car 2 starts. Thus, both Cars reach their destination in: 3 hours.

Minor variations aside, that should be the expected behavior, correct? #1 would therefore be Async Mode, and #2 is not.

There's the other potential issue of software emulation (Which Carfax's Oxide link brought up), CPU usage spiking for NV GPUs in "Async Mode" and GPU usage drops to 0%.

So there MIGHT be a slight Async Compute going on, a minor acceleration via CPU processing, but averaging out over all the kernels, the gains are miniscule and the resulting time is still ~the sum of compute & graphics only.

This goes back to what NV say about their uarch.
https://developer.nvidia.com/sites/...works/vr/GameWorks_VR_2015_Final_handouts.pdf (p31)

All our GPUs for the last several years do context switches at draw call boundaries. So when the GPU wants to switch contexts, it has to wait for the current draw call to finish first. So, even with timewarp being on a high-priority context, it’s possible for it to get stuck behind a longrunning draw call on a normal context. For instance, if your game submits a single draw call
that happens to take 5 ms, then async timewarp might get stuck behind it, potentially causing it to miss vsync and cause a visible hitch.

Why can't NV GPUs processing Async Timewarp (Async Compute feature to reduce latency!) in parallel? Why does it have to wait for a current draw call in the pipe to finish first?

TheELF · Sep 2, 2015

VR Enthusiast said:
Yes the Oxide developer said that 30%+ gains have been seen in consoles.

The dev Sebbbi on Beyond3D said gains above 30% aren't difficult.

Consoles are very different from desktops as far as cpu to gpu ratio goes.
Getting 30% more FPS out of a very weak desktop CPU with a very big GPU,yes I can totally see that happening with well written dx12 games.

sontin · Sep 2, 2015

It doesnt need to get executed in parallel because this is not required by the API.

However it gets executed by using a graphics and compute queue. And this is the only thing which matters with Asynchronous Compute.

TheELF · Sep 2, 2015

Silverforce11 said:
It can certainly launch, but it can't finish them asynchronously in parallel. According to the program at b3d that is.

It's a simple concept.

Yeah now imagine (dx11 numbers) that the nvidia cars are twice as fast as the amd cars...
instead of 1+2=2 due to async now you have 0.5+1 =1.5 even if async doesn't work at all.

Silverforce11 · Sep 2, 2015

sontin said:
It doesnt need to get executed in parallel because this is not required by the API.

However it gets executed by using a graphics and compute queue. And this is the only thing which matters with Asynchronous Compute.

If it's not executed in parallel, it means its executed in serial, which matches the data where the graphics + compute task takes a time to completion that is ~sum of the tasks separately/serial.

The entire point of Async Compute is to execute compute tasks in parallel so that they do not stall graphics rendering.

^^ This is serial execution, compute will delay graphics and graphics will delay compute. This is what NV is referring to in their GameWorks VR PDF, Async Timewarp can get stuck in the pipe, waiting for the current draw call to finish before its executed.

^^ This is the point of Async Compute. To break down tasks that were serial and do it in parallel leading to a nice performance gains.

Silverforce11 · Sep 2, 2015

TheELF said:
Yeah now imagine (dx11 numbers) that the nvidia cars are twice as fast as the amd cars...
instead of 1+2=2 due to async now you have 0.5+1 =1.5 even if async doesn't work at all.

Sure if you want to go there and use it as a benchmark, you go for it champ!

Cos 50ms latency for compute from that program for GCN is 20 fps. Therefore *any compute usage* at all, even non async, will mean no game can exceed 20 fps. Since I know for a FACT that DirectCompute DX11 games I've played runs faster than 20fps (and Ashes benchmark), it means that program is not to be used as a benchmark for "performance".

It's purpose is to test for presence/absence of Async Compute capabilities.

sontin · Sep 2, 2015

Silverforce11 said:
If it's not executed in parallel, it means its executed in serial, which matches the data where the graphics + compute task takes a time to completion that is ~sum of the tasks separately/serial.

It doesnt matter. What matters is the time the workload needs to get executed.

The entire point of Async Compute is to execute compute tasks in parallel so that they do not stall graphics rendering.

^^ This is the point of Async Compute. To break down tasks that were serial and do it in parallel leading to a nice performance gains.

So, you havent read what Microsoft wrote? :|
The purpose of Aync Compute is the knowledge of different engine types of the GPU and to submit different work within one queue. How these engines work is up to the hardware and is not definied by Microsoft.

Silverforce11 · Sep 2, 2015

It's good to see you agree Maxwell can't do Async Compute, that it cannot process graphics & compute in parallel.

At least its faster than GCN in that program, right?

sontin · Sep 2, 2015

nVidia supports Asynchronous Compute. Otherwise it wouldnt be able to use a compute queue instead of a graphics queue.

Pls, start with reading what Microsoft definies as "Asynchronous Compute":
https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx

Silverforce11 · Sep 2, 2015

Yup, it "supports" Async Compute but it can't do it in parallel*, which is why Oxide disabled that feature at the request of NV.

What's the beneficial purpose of Async Compute if you CANNOT execute graphics + compute in parallel?

If you can't do it simultaneously, just stick to DX11's serial execution..

* According to b3d program results, we still don't actually know if its valid.

This is MS's document on: Async Compute
https://msdn.microsoft.com/en-us/li...spx#asynchronous_compute_and_graphics_example

Asynchronous compute and graphics example

This next example allows graphics to render asynchronously from the compute queue. There is still a fixed amount of buffered data between the two stages, however now graphics work proceeds independently and uses the most up-to-date result of the compute stage as known on the CPU when the graphics work is queued. This would be useful if the graphics work was being updated by another source, for example user input. There must be multiple command lists to allow the ComputeGraphicsLatency frames of graphics work to be in flight at a time, and the function UpdateGraphicsCommandList represents updating the command list to include the most recent input data and read from the compute data from the appropriate buffer.

The compute queue must still wait for the graphics queue to finish with the pipe buffers, but a third fence (pGraphicsComputeFence) is introduced so that the progress of graphics reading compute work versus graphics progress in general can be tracked. This reflects the fact that now consecutive graphics frames could read from the same compute result or could skip a compute result. A more efficient but slightly more complicated design would use just the single graphics fence and store a mapping to the compute frames used by each graphics frame.
-
-
Although resource state is shared across all Compute and 3D queues, it is not generally permitted to write to the resource simultaneously on different queues. (Simultaneously here means unsynchronized, although simultaneous execution is not possible on some hardware.)

sontin · Sep 2, 2015

Silverforce11 said:
Yup, it "supports" Async Compute but it can't do it in parallel*, which is why Oxide disabled that feature at the request of NV.

What's the beneficial purpose of Async Compute if you CANNOT execute graphics + compute in parallel?

If you can't do it simultaneously, just stick to DX11's serial execution..

So, asynchronous computing of compute workload is not "Asynchronous Computing"? Or using the copy and graphics engine? Or let the graphics queue wait for a result from the compute queue?

Asynchronous Compute is more than doing graphics and compute at the same time.

Zstream · Sep 2, 2015

sontin said:
nVidia supports Asynchronous Compute. Otherwise it wouldnt be able to use a compute queue instead of a graphics queue.

Pls, start with reading what Microsoft definies as "Asynchronous Compute":
https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx

Don't be obtuse. Yes we all know that most recent graphics card can do asynchronous compute but that's not the point. It's called parallelism and Nvidia cannot do this. Who gives a hoot if Nvidia can't, just complain and figure out why. I means, if it can't who cares, nothing you're going to do about it but expose the lack of functionally. It's better to know than defend.

dogen1 · Sep 2, 2015

TheELF said:
Consoles are very different from desktops as far as cpu to gpu ratio goes.
Getting 30% more FPS out of a very weak desktop CPU with a very big GPU,yes I can totally see that happening with well written dx12 games.

Not sure how the CPU is related at all. This is about using idle gpu resources.

TheELF said:
Yeah now imagine (dx11 numbers) that the nvidia cars are twice as fast as the amd cars...
instead of 1+2=2 due to async now you have 0.5+1 =1.5 even if async doesn't work at all.

Clearly something is wrong with the program though. If AMD cards had a 20ms minimum processing time for every compute task they would never exceed 50 frames per second in games.

Ashes of the Singularity User Benchmarks Thread

Diamond Member

Golden Member

Elite Member

Diamond Member

Diamond Member

Member

Lifer

Diamond Member

Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member