(Discussion) Futuremark 3DMark Time Spy Directx 12 Benchmark

Azix · Jul 15, 2016

To keep the other thread to results.

My view of the benchmark currently is that their thinking behind it was flawed. Keeping to the lowest common denominator is not what games will do. I think we overestimate the difficulty involved in implementing features that some GPUs do not support (writing different code paths). It may take a lot of work, but its work that does not need to be done over and over for each game. The game engine is the core of the game and, if I am understanding it correctly, once that has proper support every game built on it should (maybe with some tweaking).

I think the better approach would be to target the same visuals while exploiting, as much as is reasonable, every dx12 feature.

This is probably a question of what a dedicated benchmark should represent. On one hand it could say this GPU is capable of pulling off identical visuals faster than this other GPU while exploiting features the other GPU might not have. On the other hand it could just say this GPU is faster than this other GPU using the same tools to produce the same visuals, even though it can't do it faster if the other GPU was using more efficient tools that its capable of using.

IMO since the software can check what the GPU supports and proceed appropriately, I think the previous case is more representative.

Some background: the dx12 benchmark uses fl11_0 and what looks like less advanced asynchronous compute. Probably to put all hardware on the same field. Limited to Kepler level dx12 support.

There are also weird things in the technical guide that I don't touch on because I am not familiar. Eg. Their use of Ray tracing and the effects they chose for both graphics tests. Did not sound like what games are likely to use heavily, but those things might be suitable to show hardware capabilities

Erenhardt · Jul 15, 2016

Could you give a short TL DR background on the issue? I just noticed Timy Spy released and don't know what your talking about

Azix · Jul 15, 2016

Erenhardt said:
Could you give a short TL DR background on the issue? I just noticed Timy Spy released and don't know what your talking about

on what principles should a dedicated benchmark be built. The two I see is identical visuals with different features used or identical visuals with the same. Others can suggest different

railven · Jul 15, 2016

YEah going to need a little more info. I haven't bought the benchmark or ran through the results thread.

What little I saw last night before going to do something else - I'm assuming there is a different code path for Async Computer for Nvidia versus AMD?

I assume this based on the results for that particular test where it showed NV doing relatively well.

Is that the issue? Possible two code paths one?

And I highly disagree with this part from the OP:

Keeping to the lowest common denominator is not what games will do.

As someone who buys a lot of non-AAA titles, usually by Japanese studios, you are kidding yourself if you they they will be pushing the boundaries. I'm on part 3 of the Dark Souls trilogy and trust me - that game is not going to push any boundaries (hell sad to see the horrendous pop-up in Dark Souls 3 versus the previous 2).

Majority of PC games will be port jobs that aren't going to push our hardware to limits. Ultimately this is what pushed me more to buying an Nvidia card. I found more issues with these non-AAA games on my Radeons than I have found on running a GeForce.

I don't even think these games support DX11 fully. And there are so many of them on Steam.

On the subject of a benchmark, that I would argue is different. But most benchmarks are outliers that push hardware to limits majority of the software commonly used will not.

Red Hawk · Jul 15, 2016

The point of dedicated benchmarks like 3DMark, as I see it, is at its most basic to provide a repeatable, consistent test environment. This test environment can be used to do different things:

--To compare practical performance between different graphics cards. This has limited real-world usefulness though, as game performance will vary depending on the engine and things like 3DMark and Unigine aren't used in actual games. It's most useful to compare between graphics cards of the same architecture.

--To test theoretical performance. This is most useful as a tool for developers to test when they've made driver or hardware changes that should improve theoretical performance of some aspect of the chip.

--To test efficacy of different methods of rendering, like with the Star Swarm benchmark comparing different renderers.

--To test that your hardware setup is performing properly. This is a big benefit of 3DMark keeping track of user scores according to their hardware. If you score significantly lower than the average for users with the same hardware as you, that lets you know for sure that something is hampering performance on your setup.

railven · Jul 15, 2016

Red Hawk said:
The point of dedicated benchmarks like 3DMark, as I see it, is at its most basic to provide a repeatable, consistent test environment. This test environment can be used to do different things:

--To compare practical performance between different graphics cards. This has limited real-world usefulness though, as game performance will vary depending on the engine and things like 3DMark and Unigine aren't used in actual games. It's most useful to compare between graphics cards of the same architecture.

--To test theoretical performance. This is most useful as a tool for developers to test when they've made driver or hardware changes that should improve theoretical performance of some aspect of the chip.

--To test efficacy of different methods of rendering, like with the Star Swarm benchmark comparing different renderers.

--To test that your hardware setup is performing properly. This is a big benefit of 3DMark keeping track of user scores according to their hardware. If you score significantly lower than the average for users with the same hardware as you, that lets you know for sure that something is hampering performance on your setup.

I think this is why I even continue to have benchmark software. I didn't run 3DMark with my Radeon 7970(s) to beat GeForce users. I ran them to beat other 7970 users. Haha and most importantly, see if I got golden cards or lemons. (My first 7970 was a golden sample. I regret selling it off.

)

Flapdrol1337 · Jul 15, 2016

Azix said:
Some background: the dx12 benchmark uses fl11_0 and what looks like less advanced asynchronous compute. Probably to put all hardware on the same field. Limited to Kepler level dx12 support.

If it were limited to kepler level dx12 support there wouldn't even be async compute.

Silverforce11 · Jul 15, 2016

Flapdrol1337 said:
If it were limited to kepler level dx12 support there wouldn't even be async compute.

Async Compute isn't part of FL in DX12, it's just a feature by itself, GPUs do not have to support it actually to be labeled "DX12 compliant".

The basic driver overhead & multi-thread rendering applies to all DX12 compatible GPU.

Flapdrol1337 · Jul 15, 2016

Silverforce11 said:
Async Compute isn't part of FL in DX12, it's just a feature by itself, GPUs do not have to support it actually to be labeled "DX12 compliant".

The basic driver overhead & multi-thread rendering applies to all DX12 compatible GPU.

Aren't all features kind of like that?

If this benchmark is limited to kepler features why is the 380x wrecking the 280x?

Silverforce11 · Jul 15, 2016

Flapdrol1337 said:
Aren't all features kind of like that?

If this benchmark is limited to kepler features why is the 380x wrecking the 280x?

Do you have to ask?

3DMark already presented a table of their usage scenario. Tessellation went to the moon.

Tonga kills Tahiti in Tessellation performance. In games that are heavy on that, 380X > 280X. Far Cry 4, Fallout 4, etc.

Tahiti also has much less maximum queue support compared to Tonga so if there's an excessive usage of queues (ie, lots of small jobs), it will pull ahead of Tahiti.

Headfoot · Jul 15, 2016

I think the FL_11 approach is fine -- provided they are planning on a real FL_12 test too. I would expect this is just the first couple of tests. I bet there will be a VR focused one, as well as a FL_12 one coming in the future too.

Azix · Jul 15, 2016

Red Hawk said:
The point of dedicated benchmarks like 3DMark, as I see it, is at its most basic to provide a repeatable, consistent test environment. This test environment can be used to do different things:

--To compare practical performance between different graphics cards. This has limited real-world usefulness though, as game performance will vary depending on the engine and things like 3DMark and Unigine aren't used in actual games. It's most useful to compare between graphics cards of the same architecture.

--To test theoretical performance. This is most useful as a tool for developers to test when they've made driver or hardware changes that should improve theoretical performance of some aspect of the chip.

--To test efficacy of different methods of rendering, like with the Star Swarm benchmark comparing different renderers.

--To test that your hardware setup is performing properly. This is a big benefit of 3DMark keeping track of user scores according to their hardware. If you score significantly lower than the average for users with the same hardware as you, that lets you know for sure that something is hampering performance on your setup.

I realize there are finer uses for a benchmark. The main concern is that most gamers are going to use it to assume relative performance with a certain API when in reality it might be wrong. One card being about the same as another when in most games it would fall behind. It seems for the majority that's a problem.

I think the other more relevant use for 3dmark is making sure your hardware is working as it should and possibly stress testing.

yes its most useful within the same arch if comparing hardware. hardware/driver engineers will likely be using their own tools and probably a selection of games.

Headfoot said:
I think the FL_11 approach is fine -- provided they are planning on a real FL_12 test too. I would expect this is just the first couple of tests. I bet there will be a VR focused one, as well as a FL_12 one coming in the future too.

maybe. There is a demo of the VR one they are working on. Hopefully they do more on dx12 just out of curiosity of what it would be like. This one just seems stale. It probably took a lot of work to get where they are with it so I wouldn't expect a new one soon.

What I think the situation should be is including benchmarks in a majority of games. A dedicated benchmark has its uses but ultimately well made benchmarks in games will be better.

Maybe futuremark should start offering to make benchmarks for games or create tools to do so.

Hitman928 · Jul 15, 2016

Discussion going on over at overclock.net about async in 3dmark for those interested. Mostly driven by Mahigan so maybe we can get him to discuss it over here as well for those interested.

http://www.overclock.net/t/1605899/...me-spy-directx-12-benchmark/130#post_25348245

Mahigan said:
Here is what Pascal does...

The first feature nVIDA introduced is improved Dynamic Load Balancing. Basically.. the entire GPU resources can be dynamically assigned based on priority level access. So an Async Compute + Graphics task may be granted a higher priority access to the available GPU resources. Say the Graphics task is done processing... well a new task can almost immediately be assigned to the freed up GPU resources. So you have less wasted GPU idle time than on Maxwell. Using Dynamic load balancing and improved pre-emption you can improve upon the execution and processing of Asynchronous Compute + Graphics tasks when compared to Maxwell. That being said... this is not the same as Asynchronous Shading (AMD Term) or the Microsoft term "Asynchronous Compute + Graphics". Why? Pascal can’t execute both the Compute and Graphics tasks in parallel without having to rely on serial execution and leveraging Pascal’s new pre-emption capabilities. So in essence... this is not the same thing AMD’s GCN does. The GCN architecture has Asynchronous Compute Engines (ACE’s for short) which allow for the execution of multiple kernels concurrently and in parallel without requiring pre-emption.

What is pre-emption? It basically means ending a task which is currently executing in order to execute another task at a higher priority level. Doing so requires a full flush of the currently occupied GPC within the Pascal GPU. This flush occurs very quickly with Pascal (contrary to Maxwell). So a GPC can be emptied quickly and begin processing a higher priority workload (Graphics or Compute task). An adjacent GPC can also do the same and process the task specified by the Game code to be processed in parallel (Graphics or Compute task). So you have TWO GPCs being fully occupied just to execute a single Asynchronous Compute + Graphics request. There are not many GPCs so I think you can guess what happens when the Asynchronous Compute + Graphics workload becomes elevated. A Delay or latency is introduced. We see this when running AotS under the crazy preset on Pascal. Anything above 1080p and you lose performance with Async Compute turned on.

Both of these features together allow for Pascal to process very light Asynchronous Compute + Graphics workloads without having actual Asynchronous Compute + Graphics hardware on hand.

Talking about async in 3dmark.

Mahigan said:
AI keeps the CPU busy.

This affects AMD under DX11 for other reasons but for nVIDIA it affects their drivers ability to schedule tasks. If you have less CPU load then you have more CPU time in order to efficiently utilize nVIDIAs software scheduling features.

I never said AI had anything to do with Async-compute. I said that AI affects the CPU and that the CPU is what feeds the GPU.

As for Complex AI affecting both AMD and nVIDIA in the same manner... Not under DX12/Vulkan. Because of ACEs and the hardware scheduler found on GCN coupled with the parallel nature of GCN (several ACEs) which work in conjunction with the Multi-Threaded rendering capabilities of the DX12 and Vulkan APIs.

For 3DMark... we do not have any of that pesky AI to worry about. In fact we do not even know if Asynchronous Compute + Graphics is being used or just Asynchronous Compute. If it is the latter then this would explain why both a GTX 970 and GTX 980 Ti can run the code properly without a performance loss. They both should be seeing a performance loss if this test actually is using Asynchronous Compute + Graphics. Something tells me it is not due to this.

Talking about GCN1 and small performance increase from async.

Mahigan said:
Well once the ACEs execute a compute job... the job is assigned a priority level and is placed into the Ultra Threaded Dispatch Processor. This processor handles assigning tasks to available compute resources (CUs).

If there are no available resources then the tasks pretty much fall asleep. This of course introduces latency. If the GPU is already overloaded without using Async Compute then turning on Async Compute will not help.

Kind of the same reason why Pascal has issues with AotS when using the Crazy preset. There just arent any idling resources for the Dynamic Load Balancing to work its magic.

This is also why Async Compute only offers tiny performance improvements on the XBox One and PS4.

An example that counters this is the FuryX under Doom. That GPU has a TON of idling resources and Doom makes ample use of them resulting in some pretty impressive gains.

Vega will be another GPU which will likely benefit a lot from Vulkan and DX12 due to the more-than-likely healthy amount of compute resources on tap.

More in the linked thread.

Bacon1 · Jul 15, 2016

Dang that is a lot of info, I miss his posts here.

Silverforce11 · Jul 15, 2016

Mahigan is almost correct. It's actually better than he said. Due to support of fine-grained preemption, NV has added extra cache and feedback mechanisms to ensure each SMX can halt it's task and switch to a new task, then resume. This ability gives each SMX independence from the main GPC.

In simple terms, Pascal's each SMX can now work on graphics or compute context individually. Whereas on Maxwell it was global (all SMX on graphics OR compute), a limitation which meant it took a performance hit when you try to do graphics + compute.

With their dynamic load balancing, they can schedule compute to idling SMX and graphics to the others. Under light parallel compute usage and lower resolution (where not all the SMX are required for peak rendering performance), this approach can actually result in a performance uplift.

However, when all SMXs are busy (4K demanding game), I expect there to be less performance (or none, if heavy Async is used) gains.

This is still not at GCN's level of granularity, where AMD's SMX (Compute Unit, blocks of shaders) can operate graphics + compute at the same time, ie, true async shaders.

It actually all makes sense now, as we've seen benches with MS's Async Compute example code, under light usage, Pascal gains a few %, but it drops hard as AC usage increases.

In real gameplay terms, 1080 and 1440p, Pascal has some idling SMX which can be put to use via Async Compute, as long as it's not too much compute, there will be a performance gain.

Arachnotronic · Jul 15, 2016

Silverforce11 said:
Mahigan is almost correct. It's actually better than he said. Due to support of fine-grained preemption, NV has added extra cache and feedback mechanisms to ensure each SMX can halt it's task and switch to a new task, then resume. This ability gives each SMX independence from the main GPC.

In simple terms, Pascal's each SMX can now work on graphics or compute context individually. Whereas on Maxwell it was global (all SMX on graphics OR compute), a limitation which meant it took a performance hit when you try to do graphics + compute.

With their dynamic load balancing, they can schedule compute to idling SMX and graphics to the others. Under light parallel compute usage and lower resolution (where not all the SMX are required for peak rendering performance), this approach can actually result in a performance uplift.

However, when all SMXs are busy (4K demanding game), I expect there to be less performance (or none, if heavy Async is used) gains.

This is still not at GCN's level of granularity, where AMD's SMX (Compute Unit, blocks of shaders) can operate graphics + compute at the same time, ie, true async shaders.

It actually all makes sense now, as we've seen benches with MS's Async Compute example code, under light usage, Pascal gains a few %, but it drops hard as AC usage increases.

In real gameplay terms, 1080 and 1440p, Pascal has some idling SMX which can be put to use via Async Compute, as long as it's not too much compute, there will be a performance gain.

This is not correct. Pascal's dynamic load balancing actually happens within the SMs. In other words, within a given SM, shaders can be re-allocated from compute to graphics or vice versa.

Silverforce11 · Jul 15, 2016

Arachnotronic said:
This is not correct. Pascal's dynamic load balancing actually happens within the SMs. In other words, within a given SM, shaders can be re-allocated from compute to graphics or vice versa.

It's been awhile since I've read that whitepaper. Do you know which page it's on?

Arachnotronic · Jul 15, 2016

It's not explicitly in the white paper. I asked NVIDIA directly and that's what they told me.

Silverforce11 · Jul 15, 2016

Arachnotronic said:
It's not explicitly in the white paper. I asked NVIDIA directly and that's what they told me.

You asked them and they told you...

Such an important technical information that they did not publish anywhere else! What the hell. It benefits NV to actually PUBLISH those bits of technical info.

In their whitepaper, the implications are SMX level granularity for preemption because the SMX share cache that are used to store currently executed queues in their halted state, to switch to a new task and later resuming via the cache. Per shader granularity implies each shader has distinct cache, which AFAIK does not exist on the Pascal architecture. It's SMX level cache.

Arachnotronic · Jul 15, 2016

Silverforce11 said:
You asked them and they told you...

Such an important technical information that they did not publish anywhere else! What the hell. It benefits NV to actually PUBLISH those bits of technical info.

In their whitepaper, the implications are SMX level granularity for preemption because the SMX share cache that are used to store currently executed queues in their halted state, to switch to a new task and later resuming via the cache. Per shader granularity implies each shader has distinct cache, which AFAIK does not exist on the Pascal architecture. It's SMX level cache.

I read through the whitepaper again, and I don't see them really explicitly say that. What the paper says is that Maxwell previously divvied up the machine upfront into parts that work on compute and parts that work on graphics. When one of those finishes (say, graphics workload is done), the part of the GPU that just went idle can now be filled with compute work if there's still compute left to be done.

The way I understood it is that in Maxwell, the CUDA cores are dedicated to compute/graphics upfront. If one of these workloads finishes before the other, then the CUDA cores within the SM sit there twiddling their thumbs.

With Pascal, it seems that those CUDA cores that would previously need to sit there doing nothing can now be filled with work.

Anyway, it's good to see that in Time Spy Pascal sees gains with asynchronous compute. I don't want resources in my GPU to sit there idle if they can be used.

AnandThenMan · Jul 15, 2016

Arachnotronic said:
It's not explicitly in the white paper. I asked NVIDIA directly and that's what they told me.

Who from Nvidia told you this and what exactly did they say.

Bacon1 · Jul 15, 2016

Arachnotronic said:
It's not explicitly in the white paper. I asked NVIDIA directly and that's what they told me.

Source?

ThatBuzzkiller · Jul 16, 2016

Silverforce11 said:
Async Compute isn't part of FL in DX12, it's just a feature by itself, GPUs do not have to support it actually to be labeled "DX12 compliant".

The basic driver overhead & multi-thread rendering applies to all DX12 compatible GPU.

That's not true ...

You do need to be feature level 12.0 compliant to use async compute on DX12 since it requires changes in the runtime to support async compute ...

Note that a GPU that doesn't support feature level 12.0 or above will automatically default to the Direct3D 11.3 runtime which will restrict exposing the GPUs further capabilities such as async compute or resource binding ...

Silverforce11 · Jul 16, 2016

ThatBuzzkiller said:
That's not true ...

You do need to be feature level 12.0 compliant to use async compute on DX12 since it requires changes in the runtime to support async compute ...

Note that a GPU that doesn't support feature level 12.0 or above will automatically default to the Direct3D 11.3 runtime which will restrict exposing the GPUs further capabilities such as async compute or resource binding ...

I meant GPUs don't need to support Async Compute to be labeled DX12 compatible. Example, Fermi, Kepler, Maxwell etc.

It's an optional feature.

Red Hawk · Jul 16, 2016

ThatBuzzkiller said:
That's not true ...

You do need to be feature level 12.0 compliant to use async compute on DX12 since it requires changes in the runtime to support async compute ...

Note that a GPU that doesn't support feature level 12.0 or above will automatically default to the Direct3D 11.3 runtime which will restrict exposing the GPUs further capabilities such as async compute or resource binding ...

My understanding is that the lower level nature of Direct3D 12/Vulkan exposes a GPU's asynchronous compute capabilities for programmers to use, thus making D3D12/Vulkan support a prerequisite for asynchronous compute, but technically asynchronous compute is still not an official feature of those APIs.

On the subject of asynchronous compute and Time Spy, for kicks I switched my brother's Radeon 260X into my PC so I could test it (Time Spy refuses to run on his PC for whatever reason). I had seen results from that Overclock3D thread with a 7870 showing no gain from asynchronous compute, and Mahigan explained that's probably because at that level and below all of the GPU's compute power is probably being used already. When a GPU can't provide the resources for async compute, it can even cause internal lag and slow down rendering, as seen with Maxwell chips trying to use a-compute at crazy detail levels in AOTS vs running with a-compute turned off. My tests showed that to pretty much be the case with a 260X -- there was no benefit from turning a-compute on, and in fact the scores with a-compute on were slightly lower than the scores with it off.

Point is, asynchronous compute, at least its implementation in Time Spy, does not help low-end AMD GPUs like the 260X.

(Discussion) Futuremark 3DMark Time Spy Directx 12 Benchmark

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Lifer

Golden Member

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member