Ashes of the Singularity User Benchmarks Thread

Page 26 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

tg2708

Senior member
May 23, 2013
687
20
81
I think you're misunderstanding. Asynchronous compute lets you use compute cores that are idle because they're waiting on other parts of the pipeline to finish. Asynchronous compute is a good thing, obviously, but a perfect system would not have idle time that you needed to specifically target.

I don't think he's actually trying to advertise intel hardware though, I think it's more of an observation on all currect GPUs.

And intel does actually make some pretty good GPUs now, they're just heavily limited by power and what they can actually cram in the die.

Help me understand this a bit more clearly if nvidia gpus does not suffer from idle times in dx12 why isn't their performance greater than amds? Are nvidia gpu's receiving so much information to calculate and or decouple that it can no longer keep up why we are seeing performance deficiencies?
 

Magee_MC

Senior member
Jan 18, 2010
217
13
81
Help me understand this a bit more clearly if nvidia gpus does not suffer from idle times in dx12 why isn't their performance greater than amds? Are nvidia gpu's receiving so much information to calculate and or decouple that it can no longer keep up why we are seeing performance deficiencies?

From what I understand, NV GPUs have less idle time than AMD gpus. This is why NV shines in DX11, they have a more efficient pathway and are using much more of their hardware. However, while NV GPUs have less idle time, there is still some that async compute could use to increase the GPU efficiency.

NV's current implementation of async compute is such that there is such a performance hit to using it that not only does it not increase performance, it hinders it. This may be why NV's benchmarks in AOTS in DX12 are lower than their DX11 benches.

AMD on the other hand, because it has more idle time, and performs async compute more efficiently because of the ACEs, allows performance to significantly increase by using AC. Which may be why GCN performance in DX12 significantly improves in the bench over DX11.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
From what I understand, NV GPUs have less idle time than AMD gpus. This is why NV shines in DX11, they have a more efficient pathway and are using much more of their hardware. However, while NV GPUs have less idle time, there is still some that async compute could use to increase the GPU efficiency.

NV's current implementation of async compute is such that there is such a performance hit to using it that not only does it not increase performance, it hinders it. This may be why NV's benchmarks in AOTS in DX12 are lower than their DX11 benches.

AMD on the other hand, because it has more idle time, and performs async compute more efficiently because of the ACEs, allows performance to significantly increase by using AC. Which may be why GCN performance in DX12 significantly improves in the bench over DX11.

Are you sure that's not just a lot of conjecture?

Oxide said they only use a modest amount of asynchronous compute. Most of AMDs performance increase over DX11 is likely related to driver/api overhead.
 
Last edited:

Magee_MC

Senior member
Jan 18, 2010
217
13
81
Are you sure that's not just a lot of conjecture?

Oxide said they only use a modest amount of asynchronous compute. Most of AMDs performance increase over DX11 is likely related to drive/api overhead.

No, I'm not sure and it is conjecture on my part. That's why I said that it may be the reason for AMD/NV differences in DX11/12. You're probably right that some of it is due to AMD's larger driver/API overhead in DX11. How much it actually is we don't have a way of telling unless or until we can bench AMD with and without async compute in DX12. However it does seem to fit the data that has been presented on the differences in the architectures.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
From what I understand, NV GPUs have less idle time than AMD gpus. This is why NV shines in DX11, they have a more efficient pathway and are using much more of their hardware. However, while NV GPUs have less idle time, there is still some that async compute could use to increase the GPU efficiency.

NV's current implementation of async compute is such that there is such a performance hit to using it that not only does it not increase performance, it hinders it. This may be why NV's benchmarks in AOTS in DX12 are lower than their DX11 benches.

AMD on the other hand, because it has more idle time, and performs async compute more efficiently because of the ACEs, allows performance to significantly increase by using AC. Which may be why GCN performance in DX12 significantly improves in the bench over DX11.

I am not sure your theory holds true because it goes counter to exactly how GCN was designed -- which is to minimize idle time. Are you assuming that because some units in GCN are underutilized due to older DX11 API, that as a result AMD's GPUs are more idle? Sure, if the API isn't exposing the full benefits of the GCN architecture, ACE engines will be sitting idle but with DX12 that could change -- at least that's what Oxide is telling us because they got a free performance boost utilizing those ACEs. It's also not as simple as having the GPU being occupied since some tasks should have higher priority over others, but if you prioritize a task, there is a big performance hit with context switching overhead:

index.php


index.php


"In short, here's the thing, everybody expected NVIDIA Maxwell architecture to have full DX12 support, as it now turns out, that is not the case. AMD offers support on their Fury and Hawaii/Grenada/Tonga (GCN 1.2) architecture for DX12 asynchronous compute shaders. The rather startling news is that Nvidia's Maxwell architecture, and yeah that would be the entire 900 range does not support it."
~ Guru3D

I think AMD's DX11 performance has long been traced to draw calls of the DX11 API and their driver rather than lack of utilization of ACE engines in DX11 games. This drawcall bottleneck is fixed with DX12 but AC is something totally different. It's about using the existing GPU resources more effectively (i.e., how we moved from typical shaders to DirectCompute shaders). So rather than addressing a specific bottleneck which was the case with draw calls in DX11 API, DX12/lower level APIs allow the programmer to speed up certain tasks by taking advantage of specific hardware features like Asynchronous Shaders/Compute engines already present in hardware.

I think the context is totally different. When AMD presented GCN to the world, there was no DX12 or Mantle or Vulkan. Eric Demers and the architects behind GCN explained why they made GCN with ACE engines an so on.

Very good AT article on GCN architecture:

"Now GCN is not an out-of-order architecture; within a wavefront the instructions must still be executed in order, so you can’t jump through a pixel shader program for example and execute different parts of it at once. However the CU and SIMDs can select a different wavefront to work on; this can be another wavefront spawned by the same task (e.g. a different group of pixels/values) or it can be a wavefront from a different task entirely.

Meanwhile on the compute side, AMD’s new Asynchronous Compute Engines serve as the command processors for compute operations on GCN.The principal purpose of ACEs will be to accept work and to dispatch it off to the CUs for processing. As GCN is designed to concurrently work on several tasks, there can be multiple ACEs on a GPU, with the ACEs deciding on resource allocation, context switching, and task priority.

One effect of having the ACEs is that GCN has a limited ability to execute tasks out of order. As we mentioned previously GCN is an in-order architecture, and the instruction stream on a wavefront cannot be reodered. However the ACEs can prioritize and reprioritize tasks, allowing tasks to be completed in a different order than they’re received. This allows GCN to free up the resources those tasks were using as early as possible rather than having the task consuming resources for an extended period of time in a nearly-finished state. This is not significantly different from how modern in-order CPUs (Atom, ARM A8, etc) handle multi-tasking."

AMD always meant to design GCN as a compute monster and that's why they ditched the old VLIW designed, created an ACE+CU architecture GPU design and expanded the ACEs even further from HD7970 to R9 290X. DX12 does not look like it has anything to do with this architecture because the architecture came first and was designed to be forward looking for future software. It actually looking more like the GCN architecture was so far ahead for its time when it came to DirectCompute/Asynchronous Compute, that it took until DX12 (or a specific low-level PS4 API) to actually expose the benefits of this architecture. That's my theory. That's why we are seeing console developers squeeze more performance out of PS4 and XB1 using AC in games like Uncharted 4 and XB1, while this benefits of GCN has largely been ignored on the PC.

We did see glimpses of the potential when DirectCompute was used for Global Illumination in Dirt Showdown or SSAA/shadows in games like Hitman Absolution and Sleeping Dogs but those were limited use cases.

It's interesting how NV owners defended Fermi for poor compute, deny, deny, deny, and then this was repeated with Kepler. In hindsight, both of those NV architectures bombed at compute and now we are starting to see cracks in Maxwell which if true would be a 3rd generation in a row where NV compute crippled its architecture. If ACEs/DirectCompute becomes a key factor in DX12 performance, then bye bye Fermi, Kepler and low end Maxwell cards. Hard to say for now though because it'll be a while by the time we see many DX12 games by which point many gamers will have upgraded to 16nm HBM2 GPUs.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I am not sure your theory holds true because it goes counter to exactly how GCN was designed -- which is to minimize idle time. You may be assuming that because some units in GCN are underutilized due to older DX11 API, that as a result AMD's GPUs are more idle? Sure, if the API isn't exposing the full benefits of the GCN architecture, ACE engines will be sitting idle but with DX12 that could change -- at least that's what Oxide is telling us because they got a free performance boost utilizing those ACEs.

I think you need to look up what ACEs actually are. ACEs don't do any processing. They merely queue the compute data for idle shaders, and can do it in parallel with rendering. So an architecture like GCN with greater idle resources than Maxwell should benefit more from ACS theoretically speaking.

And GCN is undoubtedly a less efficient architecture than Maxwell. Fury X has more transistors than GTX 980 Ti, but is slower and uses more energy even with the benefit of liquid cooling. Same for the 390x and GTX 980..

It's interesting how NV owners defended Fermi for poor compute, deny, deny, deny, and then this was repeated with Kepler.

Fermi was the FIRST GPGPU architecture. NVidia only crippled the GTX 580 when it came to dual precision workloads, but Fermi overall was very good at compute. GK104 was really gutted for compute performance, but GK110 was again, fairly good though not as good as GCN 1.1.

At any rate, it's not good policy to design GPUs with future workloads in mind, as you waste die space. Using this tactic, AMD have lost every single time against NVidia when it comes to power/performance..

By the time compute became a big factor for games, NVidia introduced Maxwell which had very strong compute performance. And now with ACS, by the time it becomes a big factor, NVidia will have Pascal which should at the very least equal, if not excel GCN 1.2 in that regard.
 

stuff_me_good

Senior member
Nov 2, 2013
206
35
91
I think you're misunderstanding. Asynchronous compute lets you use compute cores that are idle because they're waiting on other parts of the pipeline to finish. Asynchronous compute is a good thing, obviously, but a perfect system would not have idle time that you needed to specifically target.

I don't think he's actually trying to advertise intel hardware though, I think it's more of an observation on all currect GPUs.

And intel does actually make some pretty good GPUs now, they're just heavily limited by power and what they can actually cram in the die.
Funny because this sounds just like HT for graphics cards. Funny that intel says that in perfect system you wouldn't need it but why they don't show AMD and nvidia how to design perfect system? :rolleyes: Must be because they can't even design mediocre GPU core compared to competition.

Besides, why they still continue to use HT in CPUs, when "perfect" system doesn't need one?:rolleyes:
 

dogen1

Senior member
Oct 14, 2014
739
40
91
Funny because this sounds just like HT for graphics cards. Funny that intel says that in perfect system you wouldn't need it but why they don't show AMD and nvidia how to design perfect system? :rolleyes: Must be because they can't even design mediocre GPU core compared to competition.

Are you serious?

I just addressed this exact statement.


Must be because they can't even design mediocre GPU core compared to competition.

And intel GPUs are actually quite good. You're basing that on literally nothing, because it's not even true.
 

selni

Senior member
Oct 24, 2013
249
0
41
From what I understand, NV GPUs have less idle time than AMD gpus. This is why NV shines in DX11, they have a more efficient pathway and are using much more of their hardware. However, while NV GPUs have less idle time, there is still some that async compute could use to increase the GPU efficiency.

NV's current implementation of async compute is such that there is such a performance hit to using it that not only does it not increase performance, it hinders it. This may be why NV's benchmarks in AOTS in DX12 are lower than their DX11 benches.

AMD on the other hand, because it has more idle time, and performs async compute more efficiently because of the ACEs, allows performance to significantly increase by using AC. Which may be why GCN performance in DX12 significantly improves in the bench over DX11.

Didn't oxide say they disabled async compute on nvidia hardware because "attempting to use it was an unmitigated disaster in terms of performance and conformance"? DX11 beating 12 on nvidia hardware still hasn't been satisfactorily explained (the "because DX11 is old and has had tons of time to be optimized" line is saying "because NV's DX11 driver is better than what we could do by hand with a low level API" unless something else is going on).
 

VR Enthusiast

Member
Jul 5, 2015
133
1
0
Didn't oxide say they disabled async compute on nvidia hardware because "attempting to use it was an unmitigated disaster in terms of performance and conformance"? DX11 beating 12 on nvidia hardware still hasn't been satisfactorily explained (the "because DX11 is old and has had tons of time to be optimized" line is saying "because NV's DX11 driver is better than what we could do by hand with a low level API" unless something else is going on).

Maybe what the Oxide guy said has something to do with it.

http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1200#post_24356995

Oxide: From our perspective, one of the surprising things about the results is just how good Nvidia's DX11 perf is. But that's a very recent development, with huge CPU perf improvements over the last month.

You have to wonder how they suddenly found this performance just before the benchmark launched. AMD should probably take a closer look at the game to see if there's any more cheating there.
 
Feb 19, 2009
10,457
10
76
I think you're misunderstanding. Asynchronous compute lets you use compute cores that are idle because they're waiting on other parts of the pipeline to finish. Asynchronous compute is a good thing, obviously, but a perfect system would not have idle time that you needed to specifically target.

No, this is not true, a myth. It is not the purpose of AC to have higher uptime on shaders.

The purpose of AC is to utilize the same shaders doing graphics work, to do compute work in parallel because the calculations that the two tasks employ differ and thus can be done simultaneously.

Async_DX11_575px.png


To maximize shader uptime, the command processor and uarch layout is much more important than the 8 ACEs. It's the bottleneck we see in Fury X at 1080p where the shaders finish their graphic workload much faster than its able to be fed. This problem is minimized at 4K hence we see higher relative performance.

In a sense, AC is added efficiency but it does not solve the problem of keeping the shaders fed.

That problem will be solved for GCN in DX12 due to multi-thread rendering, we will see a big uplift in GCN 1080p performance in DX12 games without AC usage.

In the context of NV's uarch, their DX11 driver is already multi-threaded and as such they most likely have much higher uptime on their shaders, their performance at low resolution is outstanding.

I'm seeing a trend, prior to this, Async Compute was great. Now that Maxwell can't do it, some individuals here are starting to downplay the feature. Well goodluck because its one of the major feature allowing devs to extract peak performance out of console hardware.
 
Last edited:

dogen1

Senior member
Oct 14, 2014
739
40
91

tential

Diamond Member
May 13, 2008
7,348
642
121
[/I]AMD always meant to design GCN as a compute monster and that's why they ditched the old VLIW designed, created an ACE+CU architecture GPU design and expanded the ACEs even further from HD7970 to R9 290X.

Sorry, AMD just makes me lol as a company...

"You aren't using ACE's I see.... well why don't I double down on that and see how you like it then?!"
 

railven

Diamond Member
Mar 25, 2010
6,604
561
126
Sorry, AMD just makes me lol as a company...

"You aren't using ACE's I see.... well why don't I double down on that and see how you like it then?!"

In 2011 when HD 7970 was paper launched, no one thought they'd be posting in a thread over at ATF in 2015 where GCN was beating Nvidia's card.

The millions lost in sales, the countless complaints about launching with immature drivers, raising the price from $360 to $550 for the top card.

It was all for this day, when AMD is the top dog! Wait this game isn't even out yet, okay, in 2016 I will repost this.
 

tential

Diamond Member
May 13, 2008
7,348
642
121
In 2011 when HD 7970 was paper launched, no one thought they'd be posting in a thread over at ATF in 2015 where GCN was beating Nvidia's card.

The millions lost in sales, the countless complaints about launching with immature drivers, raising the price from $360 to $550 for the top card.

It was all for this day, when AMD is the top dog! Wait this game isn't even out yet, okay, in 2016 I will repost this.

You mad my HD7950 is about to wreck your 980Ti in this game....
I'm not... since you have a 980Ti you can play on right now.

Pretty sweet that I lost an auction for a R9 290 under $200 because I forgot my Ebay password since they FORCED me to use special characters. Thanks ebay, now I have the option of being always signed on, or having to request a new password each time since it deems my password is too "unsafe".
2 R9 290s for under $200 unable to get that's lame.
 
Feb 19, 2009
10,457
10
76
Did you guys read this article?
http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?print=1

There's a few more with Sony & MS as well.

I have a theory:

GCN was originally designed for consoles as its focus, with the help/request of Sony AMD made compute a focus of the uarch. It's great for close-to-metal API that can fully expose the entire hardware to extract performance, but it means GCN was crippled for DX11 on the PC as its major strength, parallel pipelines and async compute cannot be exposed by DX11.

This led to the development of Mantle, it was really a necessity for AMD since they develop a uarch that needs a console-like API. Mantle was said to be similar to the Xbone API which was co-developed with AMD/MS. So all they needed was a port of that to the PC. It forced MS into action and so they moved to develop DX12, with similar roots in the Xbone API but made for more uarchs which meant more work, more time to market.

This is the likely explanation for why Vulkan is so similar to DX12, since they share the same roots, in Mantle/Xbone API.
 
Feb 19, 2009
10,457
10
76
That to me what it does indicate is that the bottleneck is on the shader engine, because the sum of graphics+compute is still there, but yet, they are able to execute up to 31 Queues.
But somehow is not able to deal with both types of inputs at the same time.

The guys at b3d are theorizing that small batch counts like in this program runs faster on NV.

GCN has no performance penalty for huge batch counts but it doesn't have a performance bonus for small batch counts.

It's possible. Without knowing more info on the uarch and the program.

But it's a nice coincidence since in the GameWorks VR programming guide, NV recommends developers to split up large batches into many smaller ones to reduce latency of Async Timewarp on their hardware.

Either way, I'll let those guys do the interpretation.
 

Shivansps

Diamond Member
Sep 11, 2013
3,918
1,570
136
that little program is a little confusing:

390X

1. 52.28ms
Graphics only: 27.55ms (60.89G pixels/s)
Graphics + compute:
1. 53.07ms (31.62G pixels/s)

GTX960
1. 11.21ms
Graphics only: 41.80ms (40.14G pixels/s)
Graphics + compute:
1. 50.54ms (33.19G pixels/s)

Async Shaders or not, to me it looks like the GTX960 is executing the same task in the same amount of time, at least to the extend of what that program can do.
 

Osjur

Member
Sep 21, 2013
92
19
81
Yes when compared to the other camp, nvidia seems to be faster with small batches and AMD just has more horsepower for bigger chunks because the compute load in ms doesn't increase even when going to 128 parallel kernels but nvidia slows down after going over 32 kernels.

But still, if Ashes benchmark isn't valid benchmark, then this little program is far from valid. It just tests if the gpu can do graphics + compute at the same time.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
6,750
12,479
136
that little program is a little confusing:

390X

1. 52.28ms
Graphics only: 27.55ms (60.89G pixels/s)
Graphics + compute:
1. 53.07ms (31.62G pixels/s)

GTX960
1. 11.21ms
Graphics only: 41.80ms (40.14G pixels/s)
Graphics + compute:
1. 50.54ms (33.19G pixels/s)

Async Shaders or not, to me it looks like the GTX960 is executing the same task in the same amount of time, at least to the extend of what that program can do.

That's the key, it's not designed to be a performance comparison, only to test if asynchronous compute is functional or not. The person who created the test is looking into it as there are several things that could be causing the disparity, but he didn't create with any thought except to test functionality. FWIW another poster chimed in and said that in his experience, GCN had faster dispatch when going to compute than Maxwell. Also, the creator is using CPU time stamps which could give erroneous results (vs GPU time stamps) meaning a bit more investigation is needed, but initial results seems to confirm Oxide's statements.
 

Tapoer

Member
May 10, 2015
64
3
36
async compute is a patch to a somewhat broken system than a feature.
here intel employee explains what an ideal pipeline should be:

And let's remember, an ideal architecture would not require additional parallelism to reach full throughput, so while the API is nice to have, seeing "no speedup" from async compute is not a bad thing if it's because the architecture had no issues keeping the relevant units busy without the additional help :) It is quite analogous to CPU architectures that require higher degrees of multi-threading to run at full throughput vs. ones with higher IPCs.
https://forum.beyond3d.com/threads/dx12-performance-thread.57188/page-5#post-1868411

I agree to what some said.
AC seems to do what SMT does, and SMT exists because they have a hard time reaching 100% load with only one thread on a big core.

That intel employee could said the same thing about SMT/Hyper-Threading.
 

desprado

Golden Member
Jul 16, 2013
1,645
0
0
Even Project car did not market nvidia that much.
Now i think some people doubts should be cleared about Oxide.Therefore, their main priority is only AMD.