Ashes of the Singularity User Benchmarks Thread

antihelten · Sep 8, 2015

Carfax83 said:
Are you serious?

Software does this ALL OF THE TIME. Why do you think CPUs have branch prediction units? Rendering is an ordered process, so scheduling ahead of time is absolutely feasible and does not require a rendering engine to be psychic.

Games have been doing this for years. If a game is linear, which most games are, then scheduling ahead of time is easy. Even with the latest open world games like the Witcher 3, it can still be done effectively and in such a way, that NO LOADING screens are necessary..

You completely underestimate the ingenuity of developers in overcoming obstacles..

You're conflating things (again). Branch prediction is for conditional branches in an instruction pipeline, this has nothing to do with the kind of scheduling we're talking about here (which takes place before the job has even entered the graphics/compute pipeline)

And no it is not currently done in games, what you're thinking of is streaming of data (texture/mesh/map data etc.) which again is a very different issue from the kind of scheduling we're talking about here (we're talking about scheduling of tasks within the individual frame pipeline).

Headfoot · Sep 8, 2015

ITT: Hybrid software and hardware based scheduling of GPU tasks = ok. However, hybrid software and hardware decoding and encoding of HEVC h.265 = not okay as of the day gm206 was released (i.e. Haswell and newer iGPU based decode not good enough apparently).

I also distinctly remember saying it is possible there might be a way to implement dx12_1 Raster Ordered Views as a shader kernel/GPGPU software approach on GCN (or Intel's older architectures perhaps even, though I'm not familiar enough to say) and I was immediately shot down "You can't emulate feature level 12_1 that doesn't count."

Bottom line, it seems the opinion getting pushed hard in a 1000 post thread is that it's ok if/when nvidia uses a hybrid software/hardware/gpgpu emulated approach, but its not ok if/when AMD does. Typical.

Ultimately this all seems rather premature. What matters is what performs best at each price point, with all factors coming together in final, shipped games. 2 or 3 dx12 native games should be enough to start seeing a trend, and 6-9 games should be rather conclusive. Unlike some posters, I don't care what company is doing GPGPU/software emulation vs hardware implementation and I don't care which feature it is that's being emulated as long as its fast enough.

AnandThenMan · Sep 8, 2015

Carfax83 said:
For God's sake man, look at the entire context of the discussion rather than focusing in on a couple of sentences. You're worse than a journalist.

I AM looking at the entire context you're bringing up a point that doesn't need to be repeated, honestly that reminds of nonsense journalism. But you keep doing it anyway...

I'm not comparing console hardware to a PC. I'm making a point that despite the handicaps that the PC platform has compared to the consoles (ie high overhead APIs, NUMA, discrete components), the hardware is so powerful that it can overcome them with ease..

Why are you making this point? It is about as obvious as it gets the hardware is much more powerful so it can make up for wasted processor cycles. But if the hardware was so powerful that there were no other concerns we would have never seen Mantle/DX12/Vulkan.

sontin · Sep 8, 2015

Headfoot said:
I also distinctly remember saying it is possible there might be a way to implement dx12_1 Raster Ordered Views as a shader kernel/GPGPU software approach on GCN (or Intel's older architectures perhaps even, though I'm not familiar enough to say) and I was immediately shot down "You can't emulate feature level 12_1 that doesn't count."

You can emulate everything with a software renderer. There is a reason why dedicated hardware and even dedicated units are used to do things.
Neither CR nor ROV is practicable to use with shaders or an emulation through memory ordering. Would it be usefull developers had made games with it.

Asynchronous Compute will do nothing if the [sic:AMD] hardware hasnt support for it. Emulation ROV will slow down the GPU and make this graphics feature literally useless.

Headfoot · Sep 8, 2015

sontin said:
You can emulate everything with a software renderer. There is a reason why dedicated hardware and even dedicated units are used to do things.
Neither CR nor ROV is practicable to use with shaders or an emulation through memory ordering. Would it be usefull developers had made games with it.

Asynchronous Compute will do nothing if the AMD hasnt support for it. Emulation ROV will slow down the GPU and make this graphics feature literally useless.

But its ok to emulate fine grained hardware thread scheduling for asynchronous compute purposes in software right?

sontin · Sep 8, 2015

There is no emulation. There are three command queue types and the driver is responsible to link these to the right engine.

Headfoot · Sep 8, 2015

so it's ok that software controls the command queues, incurring all the same penalties a CPU based software approach entails (latency and limited width of pcie bus, use of cpu cycles, etc.) instead of in-GPU threading and command hardware?

sontin · Sep 8, 2015

There is little thing known as driver between the application and the GPU with Direct3D.

TheELF · Sep 8, 2015

No there is an API (d3d/dx 12) AND a driver between the application and the GPU.

Headfoot · Sep 8, 2015

sontin said:
There is little thing known as driver between the application and the GPU with Direct3D.

Its funny that you staunchly avoid answering the original question and instead reply with a condescending one-liner. The question is: why is it not OK to software-implement ROVs and other dx 12_1 features, but it is perfectly ok to software-implement certain aspects of thread scheduling?

I can only assume you refuse to answer the question because the answer isn't one you don't like.

It's simple. Either acceptably performant software implementations of tasks that *could* be accelerated by hardware is OK, or it is not. The answer does not depend on the company that does it, only on whether the feature is acceptably fast enough in software.

Please answer this additional question with a yes or no. Theoretically, if all dx 12_1 features could be implemented on a card in a moderately fast manner via shader kernel, GPGPU, or other software, would you consider that card dx 12_1 compliant? For purposes of this question, assume a hardware implementation would be 30% faster, but that games would still have playable framerates on the software implementation. Further assume that the driver would tell applications that the feature is available in hardware, and it would silently run the software implementation at the driver level without the application knowing. Would that card be dx 12_1 compliant: yes or no.

Carfax83 · Sep 8, 2015

antihelten said:
You're conflating things (again). Branch prediction is for conditional branches in an instruction pipeline, this has nothing to do with the kind of scheduling we're talking about here (which takes place before the job has even entered the graphics/compute pipeline).

You keep goalpost shifting. First it was about latency, and now it's about scheduling. You're insinuating that the CPU, the master of OoO processing, can't even schedule a workload ahead in anticipation of using it, despite the clear evidence that it does this all of the time.

No, the CPU isn't psychic, but the programmers who develop the game know what to expect and program trigger points to let the CPU know what data to load in advance..

Even so, OoO execution processes data based on availability, and so the CPU can basically circumvent the order in which the instructions are received if it needs to do so to prevent any stalling.

Out of order execution

Do you play any games at all? Developers have been doing this for years.

Carfax83 · Sep 8, 2015

AnandThenMan said:
Why are you making this point? It is about as obvious as it gets the hardware is much more powerful so it can make up for wasted processor cycles. But if the hardware was so powerful that there were no other concerns we would have never seen Mantle/DX12/Vulkan.

This is a flawed argument, because it doesn't take into account the wasted potential of the PC platform solely due to the high API overhead and lack of parallelism.

Yes, the raw performance of the PC has made up for a lot of the shortcomings of DirectX, but that still leaves a lot of untapped performance on the table just waiting to be exploited.

So this isn't about getting more performance, such as with faster hardware persay. This is about RECLAIMING performance that is already there, but cannot be used.

Which is exactly what DX12/Vulkan is about.

Carfax83 · Sep 8, 2015

Headfoot said:
The question is: why is it not OK to software-implement ROVs and other dx 12_1 features, but it is perfectly ok to software-implement certain aspects of thread scheduling?

Because the former IS EMULATION, whilst the latter is not..

The CPU taking on aspects of instruction scheduling is a natural workload for the CPU, and one which it is designed to do. If the CPU was actually processing these compute shaders itself, then you would have a point.

For whatever reason, AMD has decided to use hardware ACEs with basic OoO logic to schedule the compute workloads rather than using the CPU. This doesn't mean their approach is better though.. The CPU can do everything the ACEs can do and then some, as the OoO logic in a CPU is orders of magnitude more advanced...

Maybe AMD just didn't want to mess around with creating more driver overhead

Their driver department probably lacks the funding and manpower to pull it off properly..

monstercameron · Sep 8, 2015

Carfax83 said:
Because the former IS EMULATION, whilst the latter is not..

The CPU taking on aspects of instruction scheduling is a natural workload for the CPU, and one which it is designed to do. If the CPU was actually processing these compute shaders itself, then you would have a point.

For whatever reason, AMD has decided to use hardware ACEs with basic OoO logic to schedule the compute workloads rather than using the CPU. This doesn't mean their approach is better though.. The CPU can do everything the ACEs can do and then some, as the OoO logic in a CPU is orders of magnitude more advanced...

Maybe AMD just didn't want to mess around with creating more driver overhead Their driver department probably lacks the funding and manpower to pull it off properly..

I dont know about all that but I think it is safe to assume that the aces have fixed function hardware to handle all the scheduling needs while having less latency, using less power and less area than leaving all that work for the cpu.

antihelten · Sep 8, 2015

Carfax83 said:
You keep goalpost shifting. First it was about latency, and now it's about scheduling.

Oh FFS, this discussion is about latency in relation to scheduling, how bloody hard is that to understand?

Carfax83 said:
You're insinuating that the CPU, the master of OoO processing, can't even schedule a workload ahead in anticipation of using it, despite the clear evidence that it does this all of the time.

I've never said nor insinuated this, please stop it with the strawmen.

I said that if the CPU is used for scheduling tasks for the GPU (not for itself as you said), then it will likely incur a latency penalty, relative to the GPU doing it itself..

And you do understand that OoO is a technique to help with poor scheduling, and not a sign of superior scheduling capabilities. OoO basically allows the CPU to pull instructions from the instruction queue out of order and thus ignore the actual ordering/scheduling of tasks. And of course this has absolutely zero relevance for the discussion at hand, seeing as the instructions that has to be scheduled here are graphics and compute tasks which are not processed on the CPU, but on the GPU, which is not OoO.

Carfax83 said:
No, the CPU isn't psychic, but the programmers who develop the game know what to expect and program trigger points to let the CPU know what data to load in advance..

Even so, OoO execution processes data based on availability, and so the CPU can basically circumvent the order in which the instructions are received if it needs to do so to prevent any stalling.

Out of order execution

Do you play any games at all? Developers have been doing this for years.

Again with the false equivalences.

This has nothing to do with loading data ahead of time or OoO (the GPU, you know the thing that actually has to process the tasks, isn't OoO), it has to do with scheduling of compute/graphics/copying tasks within a GPU pipeline, which once again for the umptenth time, has very different timing dependencies compared to precaching of data.

And what in the world does playing games have to do with anything? Playing a game will not give you a proper understanding of the underlying programming, although I suppose if that is where you get all your knowledge from, it would explain quite a lot.

Carfax83 · Sep 8, 2015

monstercameron said:
I dont know about all that but I think it is safe to assume that the aces have fixed function hardware to handle all the scheduling needs while having less latency, using less power and less area than leaving all that work for the cpu.

Well that's a possibility, but unlikely. The ACEs have basic OoO logic and OoO consumes quite a bit of energy. Also, the 8 ACEs also take up die space which could be used for other things.

Speculation is useless though, as neither of us are AMD engineers so we'll just have to trust that AMD made the right decision for their particular architecture, just like NVidia made the right decision for theirs..

Carfax83 · Sep 8, 2015

antihelten said:
I've never said nor insinuated this, please stop it with the strawmen.

I like you how conveniently develop amnesia. Let me refresh your memory. Your own words:

You cannot load up your scheduling ahead of time, unless you have somehow come up with a way to make you rendering engine psychic (which is not entirely impossible actually, but it does require that your pipeline is extremely regular, which is rarely the case).

Source

I said that if the CPU is used for scheduling tasks for the GPU (not for itself as you said), then it will likely incur a latency penalty, relative to the GPU doing it itself..

And as I told you before, the CPU has a wide array of technology to combat latency, like large caches, SMT, OoO execution etcetera.. You act as though latency cannot be mitigated.

And of course this has absolutely zero relevance for the discussion at hand, seeing as the instructions that has to be scheduled here are graphics and compute tasks which are not processed on the CPU, but on the GPU, which is not OoO.

Oh does it now? Perhaps if you had kept up with the discussion I wouldn't have to be repeating myself constantly. The ACEs use OoO processing to check for completion, so OoO does have relevance to this discussion.

Check page 40 on this PDF.

Also OoO execution reduces latency as well by helping to prevent cache misses. The OoO execution the ACEs use are limited to checking for completion. A CPU wouldn't be limited to such, and could arrange the order of the instructions based on availability of resources.

This has nothing to do with loading data ahead of time or OoO (the GPU, you know the thing that actually has to process the tasks, isn't OoO), it has to do with scheduling of compute/graphics/copying tasks within a GPU pipeline, which once again for the umptenth time, has very different timing dependencies compared to precaching of data.

I never said the GPU is out of order, I said the ACEs use out of order to check for the command task completion.

And what in the world does playing games have to do with anything? Playing a game will not give you a proper understanding of the underlying programming, although I suppose if that is where you get all your knowledge from, it would explain quite a lot.

It matters, because a lot of the stuff you are saying contradicts what you find in games.

antihelten · Sep 8, 2015

Carfax83 said:
I like you how conveniently develop amnesia. Let me refresh your memory. Your own words:

You might want to read your own post again, you were talking about the CPU scheduling tasks for the CPU, I'm talking about the CPU scheduling tasks for the GPU (you know the only thing that's actually relevant in this case, since the GPU is the one doing the actual processing)

So no I did not say what you claimed.

Carfax83 said:
And as I told you before, the CPU has a wide array of technology to combat latency, like large caches, SMT, OoO execution etcetera.. You act as though latency cannot be mitigated.

And none of those things are relevant in the scenario we're discussing here. Again we're not talking about the CPU scheduling tasks for itself (in which case stuff like OoO would absolutely play a role), where talking about the CPU scheduling tasks for the GPU.

In this case the only real way to mitigate the communications latency between the CPU and the GPU, is to basically avoid (or minimize) the communication altogether, which is of course why you would prefer to handle the scheduling locally on the GPU.

Carfax83 said:
Oh does it now? Perhaps if you had kept up with the discussion I wouldn't have to be repeating myself constantly. The ACEs use OoO processing to check for completion, so OoO does have relevance to this discussion.

Check page 40 on this PDF.

Also OoO execution reduces latency as well by helping to prevent cache misses. The OoO execution the ACEs use are limited to checking for completion. A CPU wouldn't be limited to such, and could arrange the order of the instructions based on availability of resources.

Eerm, nobody's talking about AMD doing scheduling on the CPU, we're talking about Nvidia doing scheduling on the CPU, so the nature of AMD's ACEs are quite irrelevant, in this regard.

Carfax83 said:
It matters, because a lot of the stuff you are saying contradicts what you find in games.

In that case you shouldn't have any problems finding me an example of a game which handles the scheduling of asynchronized compute and graphics tasks on the CPU, I'll wait.

Carfax83 · Sep 9, 2015

antihelten said:
You might want to read your own post again, you were talking about the CPU scheduling tasks for the CPU, I'm talking about the CPU scheduling tasks for the GPU (you know the only thing that's actually relevant in this case, since the GPU is the one doing the actual processing)

So no I did not say what you claimed.

The context of my post was obvious. The GPU obviously cannot schedule instructions by itself, and this entire thread is about GPU performance in DX12, so how on Earth you thought I was talking about non 3D software is astonishing given that we have been discussing games the entire time.

And none of those things are relevant in the scenario we're discussing here. Again we're not talking about the CPU scheduling tasks for itself (in which case stuff like OoO would absolutely play a role), where talking about the CPU scheduling tasks for the GPU.

Well this post certainly explains a lot of why I am having such difficulty with you. You don't really seem to understand the fundamental basics of computing.

The CPU is the component which receives instructions from the program, and then sends them to the GPU. As such, ALL of the CPUs attributes matter when it comes to matters of performance in 3D games.

If they were irrelevant as you suggest, then they would not affect performance. But the benchmarks don't reflect this at all. Cache size, SMT, on die memory controllers, OoO execution, IPC etcetera ALL affect the performance of 3D games because as I said, it's the CPU that is ultimately feeding the instructions to the GPU.

In this case the only real way to mitigate the communications latency between the CPU and the GPU, is to basically avoid (or minimize) the communication altogether, which is of course why you would prefer to handle the scheduling locally on the GPU.

Latency is going to occur no matter what you do. The only thing that matters is how you deal with it.

AMD GPUs still have to contend with latency just like NVidia GPUs, since they both receive instructions from the CPU.

Eerm, nobody's talking about AMD doing scheduling on the CPU, we're talking about Nvidia doing scheduling on the CPU, so the nature of AMD's ACEs are quite irrelevant, in this regard.

You seem to think a lot of things are irrelevant. Might want to take a look at what this thread is about, before you dismiss AMD's ACEs are being irrelevant.

In that case you shouldn't have any problems finding me an example of a game which handles the scheduling of asynchronized compute and graphics tasks on the CPU, I'll wait.

If NVidia is to be believed, then they are programming that capability into their drivers right now and it will be available soon, hopefully before the first DX12 game ships this year..

But CPUs have been handling asynchronous compute scheduling for years now (in CUDA and OpenCL for instance). The only thing that's new is doing asynchronous compute in concurrence with rendering.

antihelten · Sep 9, 2015

Carfax83 said:
The context of my post was obvious. The GPU obviously cannot schedule instructions by itself, and this entire thread is about GPU performance in DX12, so how on Earth you thought I was talking about non 3D software is astonishing given that we have been discussing games the entire time.

If a GPU possesses the necessary hardware then it can in fact do the relevant scheduling itself. This is exactly what the ACEs does (remember we're specifically talking about scheduling of compute and graphics tasks in relation to async, not all scheduling)

And you're making it really hard to understand just what on earth you're talking about when you suddenly start talking about OoO on a CPU, when where talking about instructions that has to be executed on the GPU (which obviously makes the OoO capabilities of the CPU irrelevant)

Carfax83 said:
Well this post certainly explains a lot of why I am having such difficulty with you. You don't really seem to understand the fundamental basics of computing.

The CPU is the component which receives instructions from the program, and then sends them to the GPU. As such, ALL of the CPUs attributes matter when it comes to matters of performance in 3D games.

If they were irrelevant as you suggest, then they would not affect performance. But the benchmarks don't reflect this at all. Cache size, SMT, on die memory controllers, OoO execution, IPC etcetera ALL affect the performance of 3D games because as I said, it's the CPU that is ultimately feeding the instructions to the GPU.

Sigh, again with the strawman attacks, I see. I never said the capabilities of the CPU were irrelevant in general. I said that the capabilities of the CPU that you mentioned (OoO), were irrelevant when were talking about instructions that has to be executed on the GPU, and the OoO execution of the CPU will make no difference to the inherent latency of communication between the CPU and the GPU (across the PCIe bus).

You do understand that OoO on the CPU only affect instructions executed on the CPU right?

Carfax83 said:
Latency is going to occur no matter what you do. The only thing that matters is how you deal with it.

Latency is indeed going to occur no matter what, but all latency is not created equally. By having to do scheduling of async compute/graphics tasks on the CPU you introduce an extra layer of communication and thus an extra layer of latency.

Carfax83 said:
AMD GPUs still have to contend with latency just like NVidia GPUs, since they both receive instructions from the CPU.

The latency of scheduling done by the on die ACEs will almost certainly be massively shorter than the software scheduling done by Nvidia, which has to pass back and forth across the PCIe bus.

Carfax83 said:
You seem to think a lot of things are irrelevant. Might want to take a look at what this thread is about, before you dismiss AMD's ACEs are being irrelevant.

Again we're not discussing everything this thread is about. I specifically responded to you about Maxwell doing software scheduling. If you want to move the goalposts from that, then it's not really my problem.

Carfax83 said:
If NVidia is to be believed, then they are programming that capability into their drivers right now and it will be available soon, hopefully before the first DX12 game ships this year..

But CPUs have been handling asynchronous compute scheduling for years now (in CUDA and OpenCL for instance). The only thing that's new is doing asynchronous compute in concurrence with rendering.

So in other words you don't have any examples of what you claim.

Also you do understand that async compute in concurrence with rendering is the whole point of integrating it into DX12, right?

ThatBuzzkiller · Sep 9, 2015

CPU is scheduling for the GPU ?! What ?! How much more whack can that claim get ?

GPUs obviously have their own damn scheduler and ISA otherwise branching, latency, and compiling would be intractable problems ...

The only thing that a CPU does for a GPU is dispatching kernels, draw calls, and resource streaming ...

This whole noise only started because of a claim that Maxwell couldn't support concurrent execution with a graphics queue and a compute queue ...

Exhibit #1:

Do you see "dedicated" compute engines on this diagram of Maxwell ? I guess not ...

Exhibit #2:

Do you see "dedicated" compute engines with GCN ? The answer is obviously yes ...

In conclusion, Nvidia DOES NOT have hardware support to handle graphics queue in conjunction with a compute queue but AMD DOES. End of story ...

I rest my case about this whole issue of "asynchronous compute" ...

Any trolls that try to claim that Nvidia's "Hyper-Q" technology is at all comparable to AMD's "ACE" are just that, ignorant ...

Erenhardt · Sep 9, 2015

Carfax83 said:
You make it so easy to nullify your arguments.

So? Why didn't you nuliffy my point then?

First off, the PC version is using VHQ settings in the graph, which means that the draw distance is longer and there are more particle effects.

No [gówno] Sherlock. Lets compare a PS4 with 7870 level of graphics to a rig with Asus Ares 2 - Two cards that are two times more powerfull each. And for the sake of argument run games in ubersampling... to compare the CPUs inside those rigs.:thumbsdown:

Secondly, the average frame rate is still much higher than it is on the PS4 game which is locked at 60 FPS.

You just tried to make a point, bet then refuted it yourself. OK

Last but not least, you are using the campaign to illustrate your point, which is intellectually dishonest.

Multiplayer is way more intensive than the SP campaign, and that's what BF4 players play.. DigitalFoundry did a great YouTube video which shows how the PS4 performs during multiplayer:

Battlefield 4 Final Code: PS4 multiplayer frame rate tests

The frame rate plummets to the low 40s fairly often on some maps..

Do NOT pretend that a PC with 2x OCed HD7970s paired with Phenom II x6, a freaken dinosaur of a CPU is GPU bound at 1080p...
And... I feel like you didn't finish your last point. If phenom x6 drops to 50 in the same scene where ps4 keeps 60, how does multiplayer is going to change that if we know that PC system has infinite graphics processing power compared to ps4?

It is quite simple, PS4 has better API that can fully utilize its resources, even better than mantle.

Why do you think mantle boosted performance by ~10 in GPU bottlenecked situations?
http://www.anandtech.com/show/7728/battlefield-4-mantle-preview

Better api is better.... as long as the hardware is there

Profanity isn't allowed in the technical forums.
-- stahlhart

showb1z · Sep 9, 2015

This thread is getting crazy. All the info we have now appears to show AMD has an advantage in async compute, why is this so hard to accept for some people. Deal with it and wait for benchmarks of released games to know how that translates in real performance.
It's as if AMD fanboys would start claiming Nvidia has no advantage in tesselation.

Zstream · Sep 9, 2015

showb1z said:
This thread is getting crazy. All the info we have now appears to show AMD has an advantage in async compute, why is this so hard to accept for some people. Deal with it and wait for benchmarks of released games to know how that translates in real performance.
It's as if AMD fanboys would start claiming Nvidia has no advantage in tesselation.

Yeah, I don't get it either. I guess people are young and emotional.

Spjut · Sep 9, 2015

showb1z said:
This thread is getting crazy. All the info we have now appears to show AMD has an advantage in async compute, why is this so hard to accept for some people. Deal with it and wait for benchmarks of released games to know how that translates in real performance.
It's as if AMD fanboys would start claiming Nvidia has no advantage in tesselation.

There's the argument that since GCN is in both the PS4 and Xbox One, GCN on PC will get a piggyback ride

Being a Kepler owner myself, I confess I do think it's discomforting to see AMD's closest equivalents suddenly pull ahead by a fair margin in this benchmark. I of course don't feel spite towards AMD, just a bit disappointment that Nvidia's architecture won't benefit as much from DX12.

Ashes of the Singularity User Benchmarks Thread

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member