computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 22 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
This is the key fact.

I suppose that it might be possible to design a shader unit that would be fully utilized with a certain set of instructions, but this is impossible for general purpose use.

The people criticizing asynchronous shaders by AMD are blindly ignoring this. You will always have less than 100% utilization over time. There will always be some free hardware resources available. Low overhead asynchronous operations allow the efficient use of such resources.

People know this. It's just rhetoric when they try and claim otherwise.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
That's exactly what I said.As well as the AMD white page
GCN can't use all the available shaders with Dx11/one queue while nvidia can and that's why GCN gains a lot from async while nvidia doesn't.

While nVidia uses their shaders (CU's) for more ops, instead of having dedicated hardware like AMD does, they don't use all available shaders. There are functions that can't be done until others are completed and the shaders being used for these waiting functions are waiting idle.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
Ohhhhh,so AMD was charging you guys for "fixed function hardware" that was doing exactly nothing for the customer for years now...
What is this "fixed function hardware" being called? Any site that explains it?

Is this a rhetorical question?
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
That's exactly what I said.As well as the AMD white page
GCN can't use all the available shaders with Dx11/one queue while nvidia can and that's why GCN gains a lot from async while nvidia doesn't.

NVIDIA also don't fully use their compute capabilities. This is why CUDA introduced Hyper-Q. See here: http://www.nvidia.com/content/kepler-compute-architecture/images/emeai/large-video-hyper-q-2-en.jpg

This is why even NVIDIA could gain from supporting concurrent executions of Asynchronous compute + graphics.

Where NVIDIA have little to gain is from the CPU multi-threaded command buffer features of DX12. They already support most of these features under DX11.
 

Spjut

Senior member
Apr 9, 2011
931
160
106
Where NVIDIA have little to gain is from the CPU multi-threaded command buffer features of DX12. They already support most of these features under DX11.

Is it really true Nvidia has very little to gain from that?

This test is one year old now, but the Star Swarm demo shows a huge difference between running in DX11 and DX12 on a GTX 980
http://www.anandtech.com/show/8962/the-directx-12-performance-preview-amd-nvidia-star-swarm/4


Perhaps I'm just confusing the different CPU related features in DX11 and DX12, but Nvidia released optimized drivers for Star Swarm DX11 and it still can't come close to its DX12 performance
 

PhonakV30

Senior member
Oct 26, 2009
987
378
136
I really hope AMD convinces Square Enix to brings Heavy Async compute to HitMan and shows them(nvidia users) True story behind Async Compute in Nvidia Card.Mahigan provides Valid information and I see Strong sign of Lack Async compute in Nvidia Cards.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
Is it really true Nvidia has very little to gain from that?

This test is one year old now, but the Star Swarm demo shows a huge difference between running in DX11 and DX12 on a GTX 980
http://www.anandtech.com/show/8962/the-directx-12-performance-preview-amd-nvidia-star-swarm/4


Perhaps I'm just confusing the different CPU related features in DX11 and DX12, but Nvidia released optimized drivers for Star Swarm DX11 and it still can't come close to its DX12 performance
Star Swarm is a Draw call test. It tests two primary things,

1- The GPUs command processor
2- CPU (multi-threading)

The test is fairly simple and the rendered tasks are quite simple.

The test doesn't test the compute capabilities of a GPU. It is comprised of rendering 3D objects.

What you will notice, under DX11, is that the GTX 980 has a 3x advantage over the R9 290x. There's another test, the 3D Mark API Overhead test where a GTX980 has a 2x advantage over an R9 290x. Therefore we can conclude that under DX11, a GTX 980 can execute 2-3 times more draw calls than an R9 290x.

When you move to actual DX12 benchmarks, based on game engines, you don't see the GTX 980 pull ahead of the R9 290x like you see under Star Swarm.

Why?

Because NVIDIAs current architectures on the market can't render more draw calls and compute jobs than the 2-3x advantage they already enjoy under DX11.

Meaning that just because an API can unlock an X number of draw calls, feeding the GPUs command processor far more efficiently, doesn't mean that you'll end up with more FPS if your GPU can't render more than it was already rendering on the older API.

Pascal might enjoy a boost though, but current NVIDIA architectures don't as per Ashes of the Singularity (GPU bound) and Fable Legends highlight. There might be cases where a boost could be observed but those would be DX12 titles with little to no compute work (which I don't see happening) and which don't bottleneck other segments of the GPU pipeline (Geometry processor, texture units etc).
 
Last edited:

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
I really hope AMD convinces Square Enix to brings Heavy Async compute to HitMan and shows them(nvidia users) True story behind Async Compute in Nvidia Card.Mahigan provides Valid information and I see Strong sign of Lack Async compute in Nvidia Cards.

It'll be portrayed as a Gameworks move. They shouldn't have to do that. If their uarch is better fort he API, which it's supposed to be, then they won't have to do anything for it to show. I'm more concerned with nVidia blocking the use.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
No, it just fanfiction. He doesnt know anything and he is just creating false claims.

nvidia will benefit from multi thread rendering because DX11 or OpenGL doesnt allow for more than two threads at best. With Vulkan or DX12 you can use as many as you want. The API is designed to fill command queues from different threads. Anybody will benefit from it.
Explain this...

18f0fae18757e20d14536019e13ca1c3.jpg

d2f18f0f7a65ee1e92b11a7826937179.jpg
06760774965f56df1162c973d8272b7e.jpg

b109bd3ac7ec5cea82da1eece91e99b4.jpg


Why do all the GCN GPUs gain over their respective NVIDIA counterparts as the resolution increases?

Simple, we move from being CPU bound to GPU bound.

Why is GCN CPU bound at lower resolutions?

Hint: Has to do with Star Swarm and what I've been talking about.
 

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
Expect AMD to benefit quite a bit compared with NVidia from DX12 in the near future due to CPU usage and async compute compared with DX11.

Should be good for the health of PC gaming.
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
Expect AMD to benefit quite a bit compared with NVidia from DX12 in the near future due to CPU usage and async compute compared with DX11.

Should be good for the health of PC gaming.

Actually, not entirely true. It will be yet another boost injection for old GCN hardware (7970 FTW!). It means these old GPUs will serve their owners a little longer than their already retired competition. Which means amd will miss another sale compared to nvidia.


Someone who bought gtx680 already upgraded to 780 and later to 970 and will upgrade again to pascal for dx12.

Someone who bought 7970 a couple of years ago will be running dx12 games through 2016 on it
 

Glo.

Diamond Member
Apr 25, 2015
5,834
4,839
136
Actually, not entirely true. It will be yet another boost injection for old GCN hardware (7970 FTW!). It means these old GPUs will serve their owners a little longer than their already retired competition. Which means amd will miss another sale compared to nvidia.


Someone who bought gtx680 already upgraded to 780 and later to 970 and will upgrade again to pascal for dx12.

Someone who bought 7970 a couple of years ago will be running dx12 games through 2016 on it

Unless Polaris is absolutely amazing and will turn tables a bit.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Explain this...

Why do all the GCN GPUs gain over their respective NVIDIA counterparts as the resolution increases?

Simple, we move from being CPU bound to GPU bound.

Why is GCN CPU bound at lower resolutions?

Hint: Has to do with Star Swarm and what I've been talking about.

Because their DX11 is worse than nVidia's and nVidia's architecture isnt geometry bound like AMDs.

And yet his so-called fanfiction is backed by quality sources and yours has the backing that some guy on the internet thought of it. Interesting.

"quality sources"? You mean this insider guy from oxide? Haha.

If you think nVidia wont benefit from multi thread rendering then explain why this demo (Vulkan Threaded Rendering) scales over all 8 threads of my CPU:
https://developer.nvidia.com/vulkan-android#samples
 
Last edited:

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
Because their DX11 is worse than nVidia's and nVidia's architecture isnt geometry bound like AMDs.



"quality sources"? You mean this insider guy from oxide? Haha.

If you think nVidia wont benefit from multi thread rendering then explain why this demo (Vulkan Threaded Rendering) scales over all 8 threads of my CPU:
https://developer.nvidia.com/vulkan-android#samples

If NV has a smaller CPU bottleneck than AMD, then they have less potential gain.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
nVidia can push between 1.5x and 2x more draw calls with DX11. But this is just for single core and maybe two cores. When a developer will use Vulkan/DX12 with real multi threaded rendering for more draw calls nVidia will fly, too.
 

Azix

Golden Member
Apr 18, 2014
1,438
67
91
Actually, not entirely true. It will be yet another boost injection for old GCN hardware (7970 FTW!). It means these old GPUs will serve their owners a little longer than their already retired competition. Which means amd will miss another sale compared to nvidia.


Someone who bought gtx680 already upgraded to 780 and later to 970 and will upgrade again to pascal for dx12.

Someone who bought 7970 a couple of years ago will be running dx12 games through 2016 on it

or sensible consumers will look at the AMD cards and realize their money will go further if they buy AMD next time around.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
nVidia can push between 1.5x and 2x more draw calls with DX11. But this is just for single core and maybe two cores. When a developer will use Vulkan/DX12 with real multi threaded rendering for more draw calls nVidia will fly, too.

Without Async Compute those multi-cores will go wasted. You will only get the higher draw call benefit from DX-12 but only that. No Async Compute no parallel rendering inside the GPU.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Without Async Compute those multi-cores will go wasted. You will only get the higher draw call benefit from DX-12 but only that. No Async Compute no parallel rendering inside the GPU.

Great example of fanfiction. :thumbsup:
Async Compute has nothing to do with multi threaded rendering.
Async Compute is the execution of different queues concurrently on the GPU.
Multi threaded rendering is the ability of the API to fill one or more command queues with command buffers from different threads.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
Explain this...

Why do all the GCN GPUs gain over their respective NVIDIA counterparts as the resolution increases?

Simple, we move from being CPU bound to GPU bound.

Why is GCN CPU bound at lower resolutions?

Hint: Has to do with Star Swarm and what I've been talking about.
GCN is always GPU limited...
Let's take furyx vs titanx as an example, fury has 4,096 and titan 3,072 shading units but titanx has 96 GPixel/s and furyx has 67.2 GPixel/s speed.
Let's make it more simple, Nvidia has faster but fewer "cores" AMD has more but slower "cores" .

No matter what you do you can't make the cores go faster (except with O/C but that's irrelevant to this conversation)
but you can overtax the GPU,by running at 4k for example you don't need speed anymore since even with the X's you only run at ~30FPS ,it's the same thing as with gameworks, you can't gain speed with using insane levels of tessellation but you sure can make the other team look even slower then your own cards.
And it's the same with whatever the ashes benchmark is doing it's not making the game faster it just renders more stuff.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Great example of fanfiction. :thumbsup:
Async Compute has nothing to do with multi threaded rendering.
Async Compute is the execution of different queues concurrently on the GPU.
Multi threaded rendering is the ability of the API to fill one or more command queues with command buffers from different threads.

Im not talking about graphics only but Compute and Copy when im talking about parallel rendering.

Without Async Compute, the Multi-Cores will only help you to parallelize the Graphics rendering. So you will gain some performance but not what you would get if you could also run Compute and Copy at the same cycle.

So Maxwell may utilize 8 or 12 or even 32 CPU Threads but those are utilized for a single job at a time, it will be for Graphics OR Compute OR Copy, not all three at the same time because it doesnt have Async Compute.

That is why you can see your GPU scale with all your CPUs but the GPU is only working on one job at a time.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Without Async Compute, the Multi-Cores will only help you to parallelize the Graphics rendering. So you will gain some performance but not what you would get if you could also run Compute and Copy at the same cycle.

So Maxwell may utilize 8 or 12 or even 32 CPU Threads but those are utilized for a single job at a time, it will be for Graphics OR Compute OR Copy, not all three at the same time because it doesnt have Async Compute.

That is why you can see your GPU scale with all your CPUs but the GPU is only working on one job at a time.

GPUs can only execute something which got sent to the GPU. And before any developer will be sending something to the GPU it needs to get filled with commands.

Multi threaded rendering is part of the pre GPU execution process.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
GPUs can only execute something which got sent to the GPU. And before any developer will be sending something to the GPU it needs to get filled with commands.

Multi threaded rendering is part of the pre GPU execution process.

Yes, and with Async Compute you can send BOTH Graphics AND Compute AND Copy to the GPU at the same time. When without Async Compute you will only send one of them at the time.

But you need lots of CPU Threads in order to do that. That is why you will not get that much benefit from Multi-Core CPUs (Multi-Rendering) on MAXWELL. Because at the end of the day the GPU will only render one job at a time, either Graphics OR Compute OR Copy.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
GCN is always GPU limited...
Let's take furyx vs titanx as an example, fury has 4,096 and titan 3,072 shading units but titanx has 96 GPixel/s and furyx has 67.2 GPixel/s speed.
Let's make it more simple, Nvidia has faster but fewer "cores" AMD has more but slower "cores" .

No matter what you do you can't make the cores go faster (except with O/C but that's irrelevant to this conversation)
but you can overtax the GPU,by running at 4k for example you don't need speed anymore since even with the X's you only run at ~30FPS ,it's the same thing as with gameworks, you can't gain speed with using insane levels of tessellation but you sure can make the other team look even slower then your own cards.
And it's the same with whatever the ashes benchmark is doing it's not making the game faster it just renders more stuff.
GCN is not always GPU limited,

Theoretically speaking, GCN cards have a lower fillrate than GM200 cards, but games do not highlight this. You see this as the resolution goes up in DX11 games, meaning there are more pixels on the screen, GCN cards catch up to their respective competitors.

Take the Tech Report's synthetic tests here for example:
http://techreport.com/review/28513/amd-radeon-r9-fury-x-graphics-card-reviewed/4

abb391d1e568b078ecfe207a4acc5752.jpg
The FuryX achieves 64 GPixels/s out of a theoretical 67 GPixels/s. A GTX 980 Ti achieves 85 GPixels/s out of a theoretical 95 GPixels/s and the Titan X achieves 95 GPixels/s out of a theoretical 103 GPixels/s.

Theoretically, GCN is at a disadvantage which should translate into GM200 cards pulling away from GCN cards as the resolution of a game rises but we don't see this happening at all, we see the opposite actually (meaning that Pixel throughput is not a bottleneck for GCN vs SM200).

9cd20942042be7b0a14a3adec884304e.jpg
As for Tessellation requires, it also makes use of pixel pipelines. Each tessellation unit (geometry processor) has dedicated hardware for vertex fetch, tessellation, and coordinate transformations. They operate with raster engines which transform newly tessellated triangles into a fine stream of pixels for shading. In other words, as the resolution rises, you need more pixels because you end up with more triangles.

If tessellation was the main bottleneck, GM200 would pull away from GCN as the resolution rises. But we don't see that happening in a majority of DX11 titles.

Take Rise of the Tomb Raider for example:
7552f9f318f3e60093c79985e7bf7b0c.jpg
At 1600x900, look at how GCN cards relative to the SM200 cards. First the R9 390x at 79.4 FPS vs a GTX 980 at at 85.3 FPS. A 6 FPS lead for the GTX 980. Now lets compare a FuryX at 87.6 FPS vs a GTX 980 Ti at 105.5 FPS a near 18 FPS lead for the GTX 980 Ti.

Now lets move to 2560x1440:
e5373a2a6f15b2600a09b27e707b9d61.jpg
At this resolution the R9 390x has a 1.7 FPS lead over the GTX 980 and that 18 FPS lead the GTX 980 Ti had over the FuryX? Down to 1.1 FPS.

So what happened? Well as the resolution rises, we become more GPU bound rather than CPU bound at lower resolutions.

If GCN were GPU bound at lower resolutions, then as the resolution scaled higher, SM200 would continue to maintain a large lead over GCN.

Some people claim that memory bandwidth plays a role except that...
8282da34543d4ab4b6c1b24f951e617e.jpg
it doesn't. In theory yes but not in practice due to the large L2 cache found on SM200 coupled with the superior color compression algorithms.

As for GCN having more, yet slower cores, that's not true (speaking clock for clock and core for core). It has to do with the types of shaders (short or long) fed to both GCN and SM200. GCN likes long running shaders while SM200 likes short running shaders. Another determining factor is utilization. GCN is a wider architecture and requires a higher degree of parallelism in order to make use of its full compute capabilities. What do I mean by wider architecture?

On GCN:
Each CU is composed of 64 SIMD cores which can execute 40 wavefronts concurrently and each wavefront is composed of 64 threads. So that's 2,560 threads per 64 SIMD cores. For Fiji, with its 64 CUs, that's a total of 163,840 threads executing concurrently.
Source: http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

On SM200:
Each SMM is composed of 128 SIMD cores and can execute 64 warps concurrently and each warp is composed of 32 threads. So that's 2,048 threads per 128 SIMD cores. For a GTX 980 Ti, with its 22 SMMs, that's a total of 45,056 threads executing concurrently.
Source: https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/

GCN therefore has less SIMD cores dedicated towards doing more work. SM200 has more SIMD cores dedicated towards doing less work.

So if you're feeding your GPU with small amounts of compute work items, SM200 will come out on top. If you're feeding your GPU with large amounts of compute work, GCN will come out on top.

Hence why GCN stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN has more threads idling than SM200. So long as you feed both architectures with optimized code, they perform as expected:
c1e5d9e3b3c3e45436092b7268b3fcd8.jpg

3606a3da36138f4a2817c80179d81ae3.jpg


If anything, GCN has superior compute cores over SM200:
ececf71f40b7d5766e8028f058d21ccc.jpg


What is ALU latency? The amount of cycles the ALU takes to process a MADD operation. In other words, the performance of an SIMD unit. GCN is around 4 cycles while SM200 is just over 5. This behavior shocked folks over at Beyond3D when the Async compute controversy was in full force, see below...

Ext3h:
GCN has only 4 cycles latency for the simple SP instructions. Which I already accounted for, at least for "pure" SP loads, not to mention that this is also the minimum latency for all instructions on GCN and GCN also features a much larger register file.

The 6 cycle SP latency for Maxwell however is ... weird. I actually though that Maxwell had a LOWER latency for primitive SP instructions than GCN, but the opposite appears to be the case???
Source: https://forum.beyond3d.com/posts/1871515/

NVIDIA have made some large advances with SM200 over Kepler but they're still behind on per SIMD performance. It's just that SM200 dedicates more of its weaker SIMDs towards doing smaller compute work items than GCN (hence the NVIDIA recommendation of large batches of short running shaders). Ext3h made the same recommendation here: http://ext3h.makegames.de/DX12_Compute.html
For a safe bet, go with the batched approach recommended for Nvidia hardware:

Choose sufficiently large batches of short running shaders.Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
also recommended by NVIDIA here: https://developer.nvidia.com/dx12-dos-and-donts

The problem is that game titles are mostly sponsored by NVIDIA and the code they run tends to be optimized for NVIDIA. These games don't tend to be compute heavy at all so when it comes to smaller compute work items, SM200 can come out on top.

In the professional realm, CUDA optimizations and documentation far surpasses that of OpenCL.

The software makes the difference in this case.
 
Last edited: