computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Feb 19, 2009
10,457
10
76
Oh, it's german. ? (I haven't read the MS's official DX12 guide b4)

And reading that, it sounded so similar...

https://www.amd.com/Documents/Mantle-Programming-Guide-and-API-Reference.pdf

p15.
Modern GPUs have a number of different engines capable of executing in parallel ― graphics, compute, direct memory access (DMA) engine, as well as various multimedia engines. The basic
building block for GPU work is a command buffer containing rendering, compute, and other commands targeting one of the GPU engines. Command buffers are generated by drivers and
added to an execution queue representing one of the GPU engines, as shown in Figure 2. When the GPU is ready, it picks the next available command buffer from the queue and executes it.
Mantle provides a thin abstraction of this execution model.

And the graphics/model is saying the same thing. No wonder Johan and EPIC guys were saying DX12's programming guide is "deja vu" haha.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
Nonsense. nVidia never had the intention to use 20nm for their GPUs.

So and Volta was the 20nm chip or what? o_O

Nothing of it was announced for Maxwell. You even posted the roadmaps which show that HBM, nVLink and other features will be introduced with the 2016 architecture. :\

"Gamers are concerned"? Wow. AMD cards dont even support FL12_1 and yet nVidia users are concerned. :D

You have no clue what nVidia will improve or introduce. And yet you are making all the assumptions from nothing.

Fiji was supposed to be much better than GM200 and yet it gets beaten all over the place. Your postings dont make any sense. You sound like a guy from the marketing department of AMD. nVidia never said one word about Maxwell. The introduction was a huge surprise because nobody ever thought of the possiblity of such a huge improvement on the same node.

Maybe you should stop writing about things you dont know instead of hyping AMD like there will be no tomorrow.
Ok, I know that you dabble in apologetics but everyone knows both AMD and nvidia were working on 20nm parts. Maxwell v2 + Pascal - HBM2 was supposed to be nvidia's 20nm part.

http://www.tweaktown.com/news/41666...pus-with-tsmc-making-its-16nm-gpus/index.html

Where, x, is concerned is a figure of speech. Meaning that "in the interest of gamers" in that particular line. You appear to have misinterpreted the meaning of that phrase and interpreted it as "concerned gamers" (worrisome gamers).

And the rest of what you have written can be ignored.

Good day.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106

Mahigan

Senior member
Aug 22, 2015
573
0
0
One could hack in software support for a feature, like async compute for example, but when you go to use it, the hardware will fall flat on it's face. AMD could probably do the same with _1 part of dx12 it wont be very useful when one goes to use it. You can query the hardware and it will return yes to supporting something this way, but it wont be very useful (read async compute on nvidia hardwares)
Exactly, while I don't doubt nvidia's support for 12_1, it is true that "the nvidia driver exposed Async Compute as being present" but when Oxide went to use it it was "an unmitigated disaster". Oh and I didn't say what's in quotations, Oxide did.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
I am sure NVidia worked on 20nm parts and rejected them. But your link is nothing but a rumour.

The only solid info on 20nm scrapping and skipping is from AMD in their financials.
Nvidia created two sku's from one sku. What was to be a transition from Kepler to Maxwell to Volta became a transition from Kepler to Maxwell to Pascal to Volta.

What was to be the full Maxwell was split into two sku's.

Nvidia have already announced the big changes coming with Pascal. 16fp at twice the rate of fp32 and fp64 at half or 1/3 rate of fp32 on the compute side. NVLink for unified memory access in datacentres as well as 3D stacked memory.

Fp32 performance per clk and per core was increased from Kepler to Maxwell. So we've already gotten those increases. The next increases will come with Volta in this respect.

Of course we'll get higher fp32 performance and a beefier front end with Pascal. This will be achieved by adding more units in each segment of the pipeline. We may see cache increases in order to deal with the additional units.

Big Pascal (GP100) may be up to two times better than Big Maxwell under certain gaming situations as a result. (Much faster under data centre situations by a factor larger than 2x).

The wild card is Polaris. We know that it is up to 2.5x more powerful per watt than GCN3 (Fiji). This was not achieved by a simple die shrink (16nm finfet only yields 60%). So we know the differences also comes from the new units introduced with Polaris, new Command Processor, CUs, Geometry Processors, Caches, Memory Controller etc.

So the question is, how big will big Polaris be? This will determine which architecture wins out.
 
Last edited:

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
And with DX12 the question about how will new uarchs be supported is also one of the dark horses. It could be GCN 1.0/1.1 vs GCN 1.2 with Mantle in BF4 all over again. Or even worse.

There is no inherent reason that Mantle couldn't have worked well with GCN 1.2 but AMD simply didn't see any reason to put scarce resources into a deprecated API.

http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/12

With the Vulkan project having inherited and extended Mantle, Mantle’s external development is at an end for AMD. AMD has already told us in the past that they are essentially taking it back inside, and will be using it as a platform for testing future API developments. Externally then AMD has now thrown all of their weight behind Vulkan and DirectX 12, telling developers that future games should use those APIs and not Mantle.

[...]

The situation then is that in discussing the performance results of the R9 Fury X with Mantle, AMD has confirmed that while they are not outright dropping Mantle support, they have ceased all further Mantle optimization. Of particular note, the Mantle driver has not been optimized at all for GCN 1.2, which includes not just R9 Fury X, but R9 285, R9 380, and the Carrizo APU as well. Mantle titles will probably still work on these products – and for the record we can’t get Civilization: Beyond Earth to play nicely with the R9 285 via Mantle – but performance is another matter. Mantle is essentially deprecated at this point, and while AMD isn’t going out of their way to break backwards compatibility they aren’t going to put resources into helping it either. The experiment that is Mantle has come to an end.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Ok, I know that you dabble in apologetics but everyone knows both AMD and nvidia were working on 20nm parts. Maxwell v2 + Pascal - HBM2 was supposed to be nvidia's 20nm part.

http://www.tweaktown.com/news/41666...pus-with-tsmc-making-its-16nm-gpus/index.html

This article was from december 2014 after Maxwell v2 was launched and nVidia had announced FinFet for Pascal. Obviously nVidia will skip 20nm for their GPUs... :\
Where is the information that nVidia announced Maxwell will be using 20nm?

Where, x, is concerned is a figure of speech. Meaning that "in the interest of gamers" in that particular line. You appear to have misinterpreted the meaning of that phrase and interpreted it as "concerned gamers" (worrisome gamers).

The line is very clear to me. nVidia gamers would be concerned about Pascal because of your fanfiction. Yet AMD's user dont care about missing features like CR, ROV...

You are creating stories over stories to hype AMD and you cant even back them up like this:
it is true that "the nvidia driver exposed Async Compute as being present"

Tell us how we can ask the driver about it. With your insider information is should be easy to link to a article or programm.
 
Feb 19, 2009
10,457
10
76
Tell us how we can ask the driver about it. With your insider information is should be easy to link to a article or programm.

Oxide said it. Back then they also said Maxwell have issues with Async Compute, but folks like yourself dug your head in the sand refusing to see it.

Turns out they aren't capable of that function in DX12, so Oxide was right, NV had no business exposing that function on their drivers.

But since you believe you know best, even against veteran developers who are at the forefront of DX12...
 
Feb 19, 2009
10,457
10
76
"Proven to be correct"?

I compiled the OpenGPU Async Compute programm and it runs flawless on my GTX980TI. Shouldnt be possible...

Do you even know what you are doing? That isn't a NULL-test. ie. it's not a case of unable to run if you lack async compute acceleration.

https://github.com/GPUOpen-LibrariesAndSDKs/nBodyD3D12/tree/master/Samples/D3D12nBodyGravity

The NoAsync version will schedule all work onto the graphics queue and use fewer synchronization primitives.

The Async version will schedule all the simulation load onto the compute queue and all graphics onto the graphics queue, with synchronization between the queues.

To enable/disable asynchronous compute, change AsynchronousComputeEnabled in D3D12nBodyGravity.h.

The expected speed-up from the asynchronous version is around 10-15%.

Edit: For a reminder, hardware not capable of parallel graphics + compute execution will just run the queues in serial. It doesn't crash. It just does not receive benefits of faster performance for that feature. Per info extracted from Fable & Microsoft's GPUView (http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/2/).

Titan X is likely allowing the benchmarks request for async compute to go through, but instead those workloads are placed directly into the 3D render queue.

It would be disastrous if it can't fall back to serial operations, it would imply a hard crash scenario.
 
Last edited:

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Trust me, compiling is not so hard. :D
There is no performance or visual difference on my GTX980TI.
 

nvgpu

Senior member
Sep 12, 2014
629
202
81
https://forum.beyond3d.com/threads/intel-gen9-skylake.57204/page-6#post-1869935

From an API point of view, async compute is a way to provide an implementation with more potential parallelism to exploit. It is pretty analogous to SMT/hyper-threading: the API (multiple threads) are obviously supported on all hardware and depending on the workload and architecture it can increase performance in some cases where the different threads are using different hardware resources. However there is some inherent overhead to multithreading and an architecture that can get high performance with fewer threads (i.e. high IPC) is always preferable from a performance perspective.

When someone says that an architecture does or doesn't support "async compute/shaders" it is already an ambiguous statement (particularly for the latter). All DX12 implementations must support the API (i.e. there is no caps bit for "async compute", because such a thing doesn't really even make sense), although how they implement it under the hood may differ. This is the same as with many other features in the API.

From an architecture point of view, a more well-formed question is "can a given implementation ever be running 3D and compute workloads simultaneously, and at what granularity in hardware?" Gen9 cannot run 3D and compute simultaneously, as we've referenced in our slides. However what that means in practice is entirely workload dependent, and anyone asking the first question should also be asking questions about "how much execution unit idle time is there in workload X/Y/Z", "what is the granularity and overhead of preemption", etc. All of these things - most of all the workload - are relevant when determining how efficiently a given situation maps to a given architecture.

Without that context you're effectively in the space of making claims like "8 cores are always better than 4 cores" (regardless of architecture) because they can run 8 things simultaneously. Hopefully folks on this site understand why that's not particularly useful.

... and if anyone starts talking about numbers of hardware queues and ACEs and whatever else you can pretty safely ignore that as marketing/fanboy nonsense that is just adding more confusion rather than useful information.
https://forum.beyond3d.com/threads/intel-gen9-skylake.57204/page-7#post-1869983

Right so the bit people get confused with is that "I want multiple semantically async queues for convenience/middleware in the API" does *not* imply you need some sort of independent hardware queue resources to handle this, or even that they are an advantage. I hate to beat a dead horse here but it really is similar to multithreading and SMT... you don't need one hardware thread per software thread that you want to run - the OS *schedules* the software threads onto the available hardware resources and while there are advantages to hardware-based scheduling at the finer granularity, you're on thin ice arguing that you need any more than 2-3 hardware-backed "queues" here.

Absolutely, and that's another point that people miss here. GPUs are *heavily* pipe-lined and already run many things at the same time. Every GPU I know of for quite a while can run many simultaneous and unique compute kernels at once. You do not need async compute "queues" to expose that - pipelining + appropriate barrier APIs already do that just fine and without adding heavy weight synchronization primitives that multiple queues typically require. Most DX11 drivers already make use of parallel hardware engines under the hood since they need to track dependencies anyways... in fact it would be sort of surprising if AMD was not taking advantage of "async compute" in DX11 as it is certainly quite possible with the API and extensions that they have.

Now obviously I'm all for making the API more explicit like they have in DX12. But don't confuse that with mapping one-to-one with some hardware feature on some GPU. That's simply a misunderstanding of how this all works.

Yes, the scheduling is non-trivial and not really something an application can do well either, but GCN tends to leave a lot of units idle from what I can tell, and thus it needs this sort of mechanism the most. I fully expect applications to tweak themselves for GCN/consoles and then basically have that all undone by the next architectures from each IHV that have different characteristics. If GCN wasn't in the consoles I wouldn't really expect ISVs to care about this very much. Suffice it to say I'm not convinced that it's a magical panacea of portable performance that has just been hiding and waiting for DX12 to expose it.
Don't ever bother arguing with these ADFs, they are clueless as hell and don't know anything except to slander and FUD about things they don't understand.
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
Thus making the point that it isn't running async.

Where as AMD that is running async is seeing 10-15% increase.

But how can he be <5% faster then?

Sure AMD's cards are 10-15% faster with async but look at what extremes they had to go to, to get there.
The main differences are:

The CPU side threading has been removed.

The sample has slightly higher graphics load by using larger particles (increasing the load on
the blend units) and using a noise function to modify the color of each particle (increasing the compute load)

The sample is now queuing up to 4 frames, instead of 2, making sure that the GPU is always filled with work.

The number of particles and the block size has been slightly increased to increase the amount of computation. It uses now 16384 particles and work groups of 256 threads each, resulting in 64 fully filled invocations. The original sample uses 10000 particles and work groups of 128 each.
Realistic for a gaming scenario????
 

Dygaza

Member
Oct 16, 2015
176
34
101
Just curious, could someone post precompiled test so we who have no idea about compiling could test?
 

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
But how can he be <5% faster then?

Sure AMD's cards are 10-15% faster with async but look at what extremes they had to go to, to get there.

Realistic for a gaming scenario????

I think you missed the part where he said "No difference at all."

What "Extremes" are these?

This is a very basic test showing async graphics+compute, don't look into it any further than that. They added the async functionality, increased the graphics and compute load, so that the difference between async and noAsync is obvious.
 

Hitman928

Diamond Member
Apr 15, 2012
6,644
12,252
136
I think you missed the part where he said "No difference at all."

What "Extremes" are these?

This is a very basic test showing async graphics+compute, don't look into it any further than that. They added the async functionality, increased the graphics and compute load, so that the difference between async and noAsync is obvious.

From the notes:

This sample shows how to take advantage of multiple queues and is has not been tuned to maximize the overlap. In particular, the rendering is unchanged and rather cheap, which limits the amount of overlap where both the graphics queue and the compute queue can be busy. The expected speed-up from the asynchronous version is around 10-15%.

So 10-15% is the max you'd see from this test, it's not tuned to get the most out of the benchmark, just more as a proof of concept / working example.