computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

kondziowy · Feb 26, 2016

Didn't AMD replaced half of ACE engines in Fiji/Tonga to a newer versions compared to GCN 1.1? Would be nice to see power draw tests with more GCN 1.0/1.1 or Tonga cards. Are there any?

Dygaza · Feb 26, 2016

sontin said:
You can reduce the CPU workload and time:

https://developer.nvidia.com/transitioning-opengl-vulkan

nVidia cards dont take advantages from DX12 so the CPU overload should be much lower. But this doesnt happen on a GTX980TI... More proof that the rendering path is not optimized for nVidia's hardware.

More proof that you can't read. Their cpu power usage goes down.

980TI From 69W to ~55W (average of both async + non async)
Fury X from 64W to ~55W (average of both async + non async)

So 980Ti actually benefits more than Fury X. Also keep in mind that 980Ti actually still has software scheduler running on cpu. And yes, FX is a bit faster so it uses cpu a bit more.

Silverforce11 · Feb 26, 2016

Dygaza said:
More proof that you can't read. Their cpu power usage goes down.

980TI From 69W to ~55W (average of both async + non async)
Fury X from 64W to ~55W (average of both async + non async)

So 980Ti actually benefits more than Fury X. Also keep in mind that 980Ti actually still has software scheduler running on cpu. And yes, FX is a bit faster so it uses cpu a bit more.

The results for the Fury X makes a lot of sense, CPU power usage drops, GPU power usage rises, performance goes up a lot vs DX11 overall.

The thing that doesn't make sense, 980Ti DX11 vs DX12, power usage on GPU goes up without a corresponding performance up. Though, CPU power drops, as expected.

The 390X situation is the most bizarre, but I think it's that MSI model, uncapped bios, it goes nuts in Furmark too. What this suggests strongly is there needs to be a TDP limit in the bios, or DX12 could be the "Power Virus" for some cards.

This applies to all cards without a proper TDP limit in the bios, as seen here:

sontin · Feb 26, 2016

Dygaza said:
More proof that you can't read. Their cpu power usage goes down.

980TI From 69W to ~55W (average of both async + non async)
Fury X from 64W to ~55W (average of both async + non async)

So 980Ti actually benefits more than Fury X. Also keep in mind that 980Ti actually still has software scheduler running on cpu. And yes, FX is a bit faster so it uses cpu a bit more.

The CPU power consumption goes slighty down with the low level API. A true implementation of DX12 would reduce it way more because there is no advantages on nVidia hardware. DX12 allows for nearly 9x more draw calls. And in the detail graph the GTX980TI is way over 60W with DX12. The average number for this card is wrong.

BTW: the GTX980TI has no software scheduler running on the CPU. Stop it, pls. This fanfiction is annoying. Scheduling happens on the GPU. Read the Anandtech article from the GTX680 launch.

Dygaza · Feb 26, 2016

sontin said:
The CPU power consumption goes slighty down with the low level API. A true implementation of DX12 would reduce it way more because there is no advantages on nVidia hardware. DX12 allows for nearly 9x more draw calls. And in the detail graph the GTX980TI is way over 60W with DX12. The average number for this card is wrong.

You're actually, right , it looks like they have graphed their dx11 power consumption instead (even it says fastest method). Looking at 980 numbers kinda prove it.

DX12 allows for nearly 9x more drawcalls (yes in 3dmark API, where cpu doesn't really do anything else, while in real game your cpu budget is consumed by every other factor aswell (physics, AI , etc). Remember, at highest scenes there are ~35000 drawcalls / frame, while in 3dmark api you are pushing a lot more (I get 730000 drawcalls / frame). Gives you small idea how little cpu budget in games are used for drawcalls compared to test that is only about drawcall throughput.

Bottom line, your expectatations of how huge cpu power savings under dx12 should be , are way too optimistic. Sounds like you want to save so much energy, that you could charge your car's battery with all those savings.

Krteq · Feb 26, 2016

sontin said:
BTW: the GTX980TI has no software scheduler running on the CPU. Stop it, pls. This fanfiction is annoying. Scheduling happens on the GPU. Read the Anandtech article from the GTX680 launch.

Nope, nV itself tells the opposite - see Kepler and Maxwell whitepapers. Majority of scheduling tasks are performed in driver.

//And it's already explained in those AnandTech articles you are referring to

Shivansps · Feb 26, 2016

Silverforce11 said:
When has pclab ever been regularly consistent vs the rest of the tech sites?

This is why they are considered a joke site, similar to ABT.

[H] is doing all they can to be added to that list with their forum tirade against AMD too. :/ Not impressed at all.

But Anandtech also got an extrange result, and in some sites im seeing 10% gains on 1080P and 20% on 4K, how on earth you can better use of idle time where there is less idle time avalible?

Dygaza · Feb 26, 2016

Also when comparing test from different sites and their results. Make sure they are ran with newest (or same) drivers. For example Anand didn't have newest drivers in test, when some other sites did. And those drivers did have optimations for this benchmark version. And they did improve performance. How? Possibly with better Async support?

Mercennarius · Feb 26, 2016

So I ran the Ashes benchmark back in late December and was getting about 39FPS on average and it would show that I was 98-99% GPU bound on my stock 390X at 1080P Extreme settings. Running the updated benchmark yesterday and the new latest AMD drivers I now get 52FPS on Extreme settings at 1080P and it shows that I am just 40-50% GPU bound now. So it appears the CPU is getting worked much more in the most recent benchmark. These tests were all in DX12 and I have two X5690s at stock clocks for my cpu FWIW.

Azix · Feb 26, 2016

Silverforce11 said:
Well, AT got a 20% perf gain from Fury X with Async on vs off at 4K.

Toms got less. But it's very close in terms of % perf gained and % power use increase. Within margin of error.

Now that 390X at Toms, with no TDP limit, it looks as if its mining coins! lol

that MSI card has been using huge amounts of power since launch. Its the most demanding of the 390x cards I think. its just abusing the power delivery.

it was very close to the total system power consumption for other 390x cards like nitro.

sapphire 390x system consumption

http://hexus.net/tech/reviews/graphics/84194-sapphire-radeon-r9-390x-tri-x/?page=12

MSI 390x

https://www.techpowerup.com/reviews/MSI/R9_390X_Gaming/28.html

http://www.tomshardware.com/reviews/amd-radeon-r9-390x-r9-380-r7-370,4178-9.html

Mahigan · Feb 26, 2016

NVIDIA are more CPU bound than AMD under DX12. All those CPU threads are now batching work (command buffer) and NVIDIA's scheduler is static. So NVIDIA's driver is taking up more CPU time than AMDs.

DX12 does allow for multi-threaded rendering but if your hardware is using static scheduling, you're adding extra work for the CPU.

Glo. · Feb 26, 2016

Mahigan said:
NVIDIA are more CPU bound than AMD under DX12. All those CPU threads are now batching work (command buffer) and NVIDIA's scheduler is static. So NVIDIA's driver is taking up more CPU time than AMDs.

DX12 does allow for multi-threaded rendering but if your hardware is using static scheduling, you're adding extra work for the CPU.

Why is Nvidia hardware more CPU bound under DX12? Because there is no Hardware Scheduling.

It is simple as it can be.

Bacon1 · Feb 26, 2016

Mahigan said:
NVIDIA are more CPU bound than AMD under DX12. All those CPU threads are now batching work (command buffer) and NVIDIA's scheduler is static. So NVIDIA's driver is taking up more CPU time than AMDs.

DX12 does allow for multi-threaded rendering but if your hardware is using static scheduling, you're adding extra work for the CPU.

Yep, and just to clarify, you mean both are less bound than DX11, AMD is even more efficient than Nvidia due to additional hardware features.

Mahigan · Feb 26, 2016

crisium said:
Same with his bizarre Rise of the Tomb Raider Conclusion, as I posted about here. Criticizing only the Fiji cards for 4GB and then going to recommend the 4GB 980 over the cheaper, faster, and 8GB 390X.

Even when he tries to be fair, it's little illogical leaks like this that get through the cracks because he cannot help his predisposition. Some of his Nano shenanigans made me think I was on the Huffington Post.

Why did they say, between the R9 390x and GTX 980, that which ever one is cheaper should be the one you buy when everyone know the R9 390x is cheaper than the GTX 980? The R9 390x is not only cheaper but it performs better. It's a given that he shouldn't be recommending the GTX 980 at all. Basically the only NVIDIA card that should get a recommendation is well none. Neither the GTX 980 Ti or the Titan-x achieved a 30FPS minimum at 4K. Titan is at 20, which means they need to knock down the settings and the GTX 980 Ti is at 21. FuryX nets 28. At 1440p the FuryX nets a minimum of 35 FPS vs the GTX 980 Tis 29. The GTX 980 Ti has but a 0.5 FPS average advantage.

Clearly... No NVIDIA card should have been recommended. 0.

In Crossfire, they only tested the Fury, 390x, GTX 980, GTX 980 Ti. There is no FuryX. So they can't claim a VRAM issue without hard data. Where did they get that the VRAM is the culprit? I see 0 VRAM tests.

What a joke.

airfathaaaaa · Feb 26, 2016

https://semiaccurate.com/2016/02/24/looking-at-directx-12-performance-in-ashes-of-the-singularity/

Mahigan · Feb 26, 2016

As for NVIDIA's GTX 980 Ti under Ashes of the Singularity...

Its DX12 performance is near its DX11 performance. Which indicates that the GTX 980 Ti wasn't being held back under DX11. It is performing at its best, though drivers might boost it a bit.

Sontin is under the impression that a GTX 980 Ti is a faster GPU than a FuryX because of GCNs API overhead under DX11.

When that API overhead is lifted, the FuryX matches the GTX 980 Ti. When Async compute is added, the FuryX surpasses the GTX 980 Ti.

Nothing surprising here.

sontin · Feb 26, 2016

Krteq said:
Nope, nV itself tells the opposite - see Kepler and Maxwell whitepapers. Majority of scheduling tasks are performed in driver.

//And it's already explained in those AnandTech articles you are referring to

No, most of work happend on the GPU. What they moved back was scheduling of instructions into warps:

More importantly, the scheduling functions have been redesigned with a focus on power efficiency. For example: Both Kepler and Fermi schedulers contain similar hardware units to handle scheduling functions,
including,(a) register scoreboarding for long latency operations (texture and load), (b) inter-warp scheduling decisions (e.g.,pick the best warp to go next among eligible candidates), and (c) thread block level scheduling (e.g.,the GigaThread engine)

For Kepler, we realized that since this information is deterministic (the math pipeline latencies are not variable), it is possible for the compiler to determine up front when instructions will be ready to issue,
and provide this information in the instruction itself. This allowed us to replace several complex and power-expensive blocks with a simple hardware block that extracts the pre-determined latency
information and uses it to mask out warps from eligibility at the inter-warp scheduler stage.

http://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf

flash-gordon · Feb 26, 2016

What I see is that nVidia designed their cards and drivers to perform exactly where it could under DX11 knowing the limitations of the API. They just don't needed 30k draw calls because no one focused on that.

GCN is an amazing arch, proved to be superior in a lot of cases, and it's changing games development. But betting on this hardware scheduler so soon under DX11 cost them a lot of performance (and market share), even if at the end Fiji will probably be faster than 980Ti.

I bet this advantage won't be here on the next gen, just like bandwidth problems are gone. It will be pure shader and ROPs brute force.

Mahigan · Feb 26, 2016

sontin said:
No, most of work happend on the GPU. What they moved back was scheduling of instructions into warps:

http://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf

You misunderstand. That's talking about scheduling from within an SMM. What schedules work to the SMM? If If you look at a Maxwell block shot, you'll notice a PCI Express interface, then the Gigathread Engine (a large Queue) and then several SMMs. So my question to you is, where's the Scheduler?

Now look at this Polaris Block

radeon-technologies-group-technology-summit-polaris-presentation-11.jpg

Notice a scheduler at the top?

And yes, each GCN compute unit also has a scheduler:

Glo. · Feb 26, 2016

Mahigan said:
You misunderstand. That's talking about scheduling from within an SMM. What schedules work to the SMM? If If you look at a Maxwell block shot, you'll notice a PCI Express interface, then the Gigathread Engine (a large Queue) and then several SMMs. So my question to you is, where's the Scheduler?

MegaThread Engine relies on drivers - eg. Software Scheduling.

There is no hardware scheduler on hardware level of Nvidia GPUs.

sontin · Feb 26, 2016

AMD describes something as "scheduler". It means they have a "scheduler" and nVidia not. You cant mean this serious, or?

Here are picture of Fermi, Kepler and Maxwell:

www.anandtech.com/show/2977/nvidia-s-geforce-gtx-480-and-gtx-470-6-months-late-was-it-worth-the-wait-/3

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2

http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/3

And this is the Gigathread Engine:

GigaThread
Thread Scheduler
One of the most important technologies of the Fermi architecture is its two-level, distributed thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units.

http://www.nvidia.com/content/pdf/f...dia_fermi_compute_architecture_whitepaper.pdf

Mahigan · Feb 26, 2016

I am serious, work is scheduled to the Gigathread engine by the NVIDIA driver. The NVIDIA driver re-orders grids, performs shader swaps etc at the driver level and then schedules the work to the Gigathread engine which holds the work in queue. AWS (Asynchronous Warp Schedulers) within each SMM grab work from the Gigathread engine and schedule it for execution by the various units within the SMM.

GCN fetches batches and schedules the work for execution by either the Command Processor or the ACE's. These units will execute the work to various units within the GPU. Work coming into a CU is then scheduled for execution by the SIMDs.

AMD GCN is a hardware scheduling architecture.
Maxwell is a Hybrid.

Fermi had a scheduler:

Glo. · Feb 26, 2016

In other words: AMD GCN can adapt itself to application without the interference of driver.

Nvidia hardware MUST use driver to have proper scheduling, because without it - MegaThread Engine has no clue what to do with the application.

sontin · Feb 26, 2016

Mahigan said:
I am serious, work is scheduled to the Gigathread engine by the NVIDIA driver. The NVIDIA driver re-orders grids, performs shader swaps etc at the driver level and then schedules the work to the Gigathread engine which holds the work in queue. AWS (Asynchronous Warp Schedulers) within each SMM grab work from the Gigathread engine and schedule it for execution by the various units within the SMM.

The GigaThread doesnt hold something in a queue. It is a pool and it schedules work from this pool to free compute units/cluster.

AMD GCN is a hardware scheduling architecture.
Maxwell is a Hybrid.

Maxwell is not hybrid. Stop making things up. Scheduling of Warps happens on the GPU through scheduling units.

Fermi had a scheduler:

Kepler and Maxwell, too. :\

Mahigan · Feb 26, 2016

sontin said:
The GigaThread doesnt hold something in a queue. It is a pool and it schedules work from this pool to free compute units/cluster.

Maxwell is not hybrid. Stop making things up. Scheduling of Warps happens on the GPU through scheduling units.

Kepler and Maxwell, too. :\

The Gigathread engine is not a processor. If it were, it would be called the "Gigathread Processor". It does not execute tasks, it waits for an available SMM to signal it for work.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards. And since GK104 was to have 32 of these complex hardware schedulers, the scheduling system was reevaluated based on area and power efficiency, and eventually stripped down.

The end result is an interesting one, if only because by conventional standards its going in reverse. With GK104 NVIDIA is going*back*to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the codes efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Keplers math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermis complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIAs compiler. In essence its a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so its not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we cant imagine NVIDIA would go this way.

Source: http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

Like I said, NVIDIAs driver re-orders grids (order of instructions in a warp). That's static scheduling. AMD, on the other hand, does this process in hardware.

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Senior member

Member

Lifer

Diamond Member

Member

Golden Member

Diamond Member

Member

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member