Ashes of the Singularity User Benchmarks Thread

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

tential

Diamond Member
May 13, 2008
7,348
642
121
Found some very pertinent info relating to context switch and why it may incur a performance hit (as mentioned by @Zlatan awhile ago, some of the members here attack him personally which is very shameful). This is directly from NV:

https://developer.nvidia.com/sites/...works/vr/GameWorks_VR_2015_Final_handouts.pdf

p31



Basically the pipeline is in-order, serial. Here they are referring to Async Shaders to perform timewarp:



So i think I understand when gamedevs describe Kepler/Maxwell as being in-order, and GCN out-of-order (stateless, no context switch necessary).

This can cause issues if developers do not optimize or design their engine around the limitations of NV's uarch when they use Async Shaders on NV GPUs. I believe this is what we are seeing with the performance drop on DX12 in Ashes.

As for how this impacts their VR latency, the timewarp uses async shaders to quickly render the last frame in response to user movement. That frame gets priority context, else it lags. The problem for NV's uarch is it isn't true timewarp/priority because it still has to wait for traffic in front to be cleared (finish rendering the last frame first!), so NV suggest to developers to break up their draw calls into many smaller batches to avoid long delays which can cause stutter/perf drops. If devs/engines do this, it can reduce the latency when async shaders are used, but if they do not, we get problems.


So essentially, it's easier to code for async shaders in dx12, and coding for nvidia hardware requires extra work? However, extra work provides benefits to both Nvidia and AMD?

So in the end, once they fully optimize their game, both vendors will get the benefits
 
Feb 19, 2009
10,457
10
76
So essentially, it's easier to code for async shaders in dx12, and coding for nvidia hardware requires extra work? However, extra work provides benefits to both Nvidia and AMD?

So in the end, once they fully optimize their game, both vendors will get the benefits

Extra work only benefits NV, since GCN is out-of-order & stateless, it doesn't care about the traffic in front slowing it down.

NV will have to work with devs to ensure their implementation of async shaders does not cause stalls major in the pipeline. But it's not fully 100%, it just reduces the negative impact, not remove it. Pascal will remove it if they move to an out-of-order stateless pipeline uarch (which IIRC, Zlatan did say Pascal would be much better on this front as well as having fine-grained preemption capability like GCN does).
 
Last edited:

tential

Diamond Member
May 13, 2008
7,348
642
121
You just said it can reduce the latency when async shaders are used? Doesn't that not benefit AMD?

Or did I not understand that?
 
Feb 19, 2009
10,457
10
76
You just said it can reduce the latency when async shaders are used? Doesn't that not benefit AMD?

Or did I not understand that?

In the context of NV, you should read the pdf. Async compute has a latency penalty on NV's pipeline because its in-order, it has to wait for the rendering to finish before the compute job is handled. So if a draw call for the frame render takes 5ms, a compute call has to WAIT 5ms before it can commence. That's 5ms latency. If devs optimize their engines so they submit a ton of smaller draw calls, instead of 1 chunk 5ms, they split it into 5x 1ms chunks, async compute can squeeze in at each 1ms interval, rather than waiting 5ms. So they reduce the latency from 5ms to 1ms as an example.

GCN doesn't care about context or in-order operations, its stateless (no context switch required for shaders to do rendering/compute) and its out of order. So async compute on GCN is 0ms latency added.

This is why LiquidVR was said to achieve 10ms motion to photon output because there is no latency overhead.

This is my understanding of it from the documents, I could be wrong. We'll need inputs from qualified people to be more certain. @Zlatan?

Edit: What's not wrong, is that devs need to be very careful in their DX12 engines or they could degrade performance for NV when they use async compute or shaders.
 
Last edited:

tential

Diamond Member
May 13, 2008
7,348
642
121
Ok, I had thought the comment had applied to both AMD/Nvidia. Makes more sense now thanks.
 

shady28

Platinum Member
Apr 11, 2004
2,520
397
126
Maybe come back and revisit the subject when the game doesn't crash Nvidia's video drivers. Yes, this is the video driver crashing over and over - had to do a hard reboot :

5qL5QIO.jpg



This was the first try bench on my 960 :

DX 12

qV8Zvd4.jpg


DX 11

v9VRDBR.jpg
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
In the context of NV, you should read the pdf. Async compute has a latency penalty on NV's pipeline because its in-order, it has to wait for the rendering to finish before the compute job is handled.

I think you're extrapolating a bit much here. I looked at the PDF, and nowhere does it mention anywhere about in-order pipelines or serial anything.. You added those terms yourself.

Latency is a big deal with VR because of player head movement, and AMD's ACEs help out a lot I'm sure. But in regular gaming, asynchronous compute won't be anywhere near as latency sensitive as VR for obvious reasons.

You guys keep wanting to portray NVidia's pipeline as serial and lacking parallelism, but this is completely counterintuitive to everything we know about modern GPUs. GPUs make their bread and butter by being as parallel as possible..

Regarding context switches, they are done in hardware and many carry no penalties at all if they are done within kernel. GPU designers have been dealing with context switches for years, such as PhysX which requires CUDA, and for the most part even that was fairly successful..

Asynchronous compute is native to DX12, so even if it somehow does require context switching (which I doubt), there should be little or no penalty.

It's the same reasoning what led NVidia to use DirectCompute for Hairworks instead of using CUDA, because DirectCompute is native to DX11 and doesn't require context switching.
 
Feb 19, 2009
10,457
10
76
All our GPUs for the last several years do context switches at draw call boundaries. So when the GPU wants to switch contexts, it has to wait for the current draw call to finish first.

So, even with timewarp being on a high-priority context, it’s possible for it to get stuck behind a longrunning draw call on a normal context. For instance, if your game submits a single draw call that happens to take 5 ms, then async timewarp might get stuck behind it, potentially causing it to miss vsync and cause a visible hitch.

NV is clearly telling devs how to optimize for their in-order pipeline and here you are Carfax83, saying NOOOO!

I guess coming from NV still fails to convince you, then nothing will. Nuff said. You are convinced of your preferred GPU's superiority, despite evidence from nVidia themselves stating there are potential issues with it running async compute.

ps. HairWorks is via Tessellation. TressFX is via DirectCompute. Funny you bring up GPU PhysX, it explains why fps performance suffers badly when its used in games.

Also, this statement: "GPUs make their bread and butter by being as parallel as possible.. " Doesn't mean anything. GPUs have been parallel for rendering for a long time. Everyone knows that. The current discussion, is whether GPUs and particular, Maxwell can be parallel with simultaneous rendering + compute tasks. We know Kepler cannot, NV have said so themselves.
 
Last edited:

jj109

Senior member
Dec 17, 2013
391
59
91
NV is clearly telling devs how to optimize for their in-order pipeline and here you are Carfax83, saying NOOOO!

I guess coming from NV still fails to convince you, then nothing will. Nuff said. You are convinced of your preferred GPU's superiority, despite evidence from nVidia themselves stating there are potential issues with it running async compute.

ps. HairWorks is via Tessellation. TressFX is via DirectCompute. Funny you bring up GPU PhysX, it explains why fps performance suffers badly when its used in games.

Calm down. Your interpretation is tainted by the speculation that's been spread around recently. GCN has graphics contexts as well, and can only run one at a time.

The next logical thing to discuss is Context Switching, which is business as usual for a GPU, and its purpose is to keep the GPU utilization as high as possible. The reason for this is because GCN is an In-Order processor (instructions are fetched, executed & completed in order they are issued. If an instruction stalls, they it causes other instructions ‘behind it’ to stall also), and thus ensuring the pipeline is running smoothly is critical for best performance.
The typical desktop GCN architecture according to documentation processes a single context at a time. Naturally (as we’ve just seen above), it’s possible to operate on multiple, but to do this you’ll need to run them serially and context switch. GCN also process compute, and can process as contexts based on the number of ACE’s that are available. Typically, in CPU’s the length an operation takes is known as a ‘Time Slice’. The length of each time slice can be critical to balancing system performance vs process responsiveness – if the time slice is too short then the scheduler will consume too much processing time, but if the time slice is too long, processes will take longer to respond to input.
So in summary: both Maxwell and GCN are able to run many compute contexts, but only 1 graphics context. Nvidia is referring to the singular graphics context which has control over all the hardware.

oW1UOr3.png


Which raises the question: how is AMD claiming to have fully async time warp using ACEs if the ACEs can't access the full graphics pipeline? Doesn't the command processor still have to take over for the output? The command processor only has 1 queue. What happens when there's a draw call tying it up?
 
Feb 19, 2009
10,457
10
76
"The reason for this is because GCN is an In-Order processor (instructions are fetched, executed & completed in order they are issued. If an instruction stalls, they it causes other instructions ‘behind it’ to stall also), and thus ensuring the pipeline is running smoothly is critical for best performance."

First time I seen anyone say this of GCN.

Would be impossible for AMD's VR implementation.

But the article did have this to say:

"We can speculate that there’s a good chance Microsoft did indeed customize the Command Processor(s) to an extend, because of a quote from one of the Xbox One’s architects, Andrew Goossen with EuroGamer: “We also took the opportunity to go and highly customise the command processor on the GPU."

“In particular, compute tasks can leapfrog past pending rendering tasks, enabling low-latency handoffs between CPU and GPU.”

By that definition, GCN is no longer an in-order pipeline uarch.

Sony also customized GCN for the purpose of strong Async Compute performance (as noted by the links already posted several times in this thread).

This was back in 2009/2010!

Also from your quote, this is the important part:

"GCN also process compute, and can process as contexts based on the number of ACE’s that are available."

It therefore doesn't need a context switch because it has 8x ACEs that run it in parallel to the main graphics.

"Compute on the Xbox One runs in parallel with graphics workloads"

But that's not possible in DX11 because the ACEs are not accessed by the API. GCN is an in-order uarch applies to DX11 but will change in DX12.

A lot will obviously change in the future with DirectX12 – but how this integrates with the Xbox One isn’t known. With the current DirectX 11 mode, the CPU talks to the GPU a single core a time (so in other words, not in parallel). In the DX12 future, this will change because each core can issue instructions to the GPU and talk with the GPU simultaneously.

This pretty much lines up with what people have said about the ACEs being idle in DX11 and GCN can't flex its muscles!

Another slight problem with the Xbox One’s GPU (compared to say the Playstation 4’s, or indeed a more modern PC GPU such as for example the R9 290) is that the total number of ACE’s is lower. While the two ACE’s on the Xbox One can handle eight queues each, the Playstation 4 (and modern desktop GPU’s) support 8 ACEs. In the case of the PS4, this means that the GPU can handle a total number of 64 compute queues, which combined with the Level 2 Volatile Bit certainly gives the PS4 a bit of a helping hand in certain situations.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
NV is clearly telling devs how to optimize for their in-order pipeline and here you are Carfax83, saying NOOOO!

I guess coming from NV still fails to convince you, then nothing will. Nuff said. You are convinced of your preferred GPU's superiority, despite evidence from nVidia themselves stating there are potential issues with it running async compute.

You're conflating two different things. Asynchronous compute for regular gaming, and asynchronous warp for VR.. The discussion we're having is about Maxwell's asynchronous compute capability for regular gaming.

As I said earlier, asynchronous compute for regular gaming won't be bound by latency unlike VR. And honestly the whole VR thing isn't really a big deal in my opinion, as VR won't be a thing for probably a few more years....if it ever even happens. More than likely it will probably stay a niche product for consumers. I can see VR having more use for other avenues though, especially Military, aviation, law enforcement etcetera for training.

ps. HairWorks is via Tessellation. TressFX is via DirectCompute. Funny you bring up GPU PhysX, it explains why fps performance suffers badly when its used in games.

Actually hairworks uses DirectCompute for the hair simulation.

Also, this statement: "GPUs make their bread and butter by being as parallel as possible.. " Doesn't mean anything. GPUs have been parallel for rendering for a long time. Everyone knows that. The current discussion, is whether GPUs and particular, Maxwell can be parallel with simultaneous rendering + compute tasks. We know Kepler cannot, NV have said so themselves.

Yeah well I keep hearing you guys drop terms like serial and in order when referring to Maxwell and I'm like o_O

With Kepler and Fermi, yes they cannot do asynchronous compute. But Maxwell can..
 
Last edited:
Feb 19, 2009
10,457
10
76
With Kepler and Fermi, yes they cannot do asynchronous compute. But Maxwell can..

Sure it can, didn't say it can't. Whether its good at it or not is up in the air, until you find many game devs who praise it for that... it'll remain questionable.

To be more correct, HairWorks uses both DirectCompute (to simulate) & Tessellation (to generate the hairs). TressFX is pure DirectCompute.
 
Last edited:

Goatsecks

Senior member
May 7, 2012
210
7
76
Silverforce,

If it is not: "FPS per inch baby", marketing slides or "the fury-x is a better overclocker than the 980ti", you are trash talking nvidia. You argue the toss over stupid details relating to irrelivant metrics just to be able to claim AMD is superior to Nvidia in some amazingly obscure way. I have no idea where you get the time or energy for all these posts.

Please stop attacking other posters. -Admin DrPizza
 
Last edited by a moderator:
Feb 19, 2009
10,457
10
76
Silverforce,

If it is not: "FPS per inch baby", marketing slides or "the fury-x is a better overclocker than the 980ti", you are trash talking nvidia. You argue the toss over stupid details relating to irrelivant metrics just to be able to claim AMD is superior to Nvidia in some amazingly obscure way. I have no idea where you get the time or energy for all these posts.

So, what's your contribution to this thread or the other one which you crapped all over?

If discussing the merits of different GPU uarchs and providing a source is beyond your brain to comprehend, then you are free to steer clear.

Edit: Seems you have no interest to contribute besides ad hominem.

I'm sure you can comprehend that "beyond your brain to comprehend" is a personal attack as well. The two of you need to stop. -Admin DrPizza
 
Last edited by a moderator:

Goatsecks

Senior member
May 7, 2012
210
7
76
If discussing the merits of different GPU uarchs and providing a source is beyond your brain to comprehend, then you are free to steer clear.

It is very disingenuous for you to suggest your posts are anything other than the biased cheerleading that they are.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
It is very disingenuous for you to suggest your posts are anything other than the biased cheerleading that they are.

And your posts doesnt add anything to the technical conversation of GCN vs Maxwell 2.0

If you have to add any technical information by all means do so, attacking Silverforce11 is not adding anything to the conversation. In fact, this is forbidden and considered as off topic, personal attack, trolling etc.

Nor has your post contributed to an on topic technical conversation? Just report the posts, don't add to the derailment/fan the flames, please. Thanks. -Admin DrPizza
 
Last edited by a moderator:

zlatan

Senior member
Mar 15, 2011
580
291
136
In the context of NV, you should read the pdf. Async compute has a latency penalty on NV's pipeline because its in-order, it has to wait for the rendering to finish before the compute job is handled. So if a draw call for the frame render takes 5ms, a compute call has to WAIT 5ms before it can commence. That's 5ms latency. If devs optimize their engines so they submit a ton of smaller draw calls, instead of 1 chunk 5ms, they split it into 5x 1ms chunks, async compute can squeeze in at each 1ms interval, rather than waiting 5ms. So they reduce the latency from 5ms to 1ms as an example.

GCN doesn't care about context or in-order operations, its stateless (no context switch required for shaders to do rendering/compute) and its out of order. So async compute on GCN is 0ms latency added.

This is why LiquidVR was said to achieve 10ms motion to photon output because there is no latency overhead.

This is my understanding of it from the documents, I could be wrong. We'll need inputs from qualified people to be more certain. @Zlatan?

Edit: What's not wrong, is that devs need to be very careful in their DX12 engines or they could degrade performance for NV when they use async compute or shaders.

First of all, VR is a different kind of workload, and LiquidVR is a software solution built around Mantle, so most of the advantages are comes from the upgraded Mantle API, and not from the hardware. Even if another IHV will have the hardware for VR, they won't allowed to use Mantle and LiquidVR.

The in-order logic won't be a big issue in the first round. It can be an issue later, but there is a huge difference on what an API+hardware capable for, and how the devs use it. Most engines are not designed for D3D12, so in the first round the primary focus will be a new engine structure for the new APIs. In this aspect most devs will ensure some backward compatibility with D3D11, and in this case most multi-engine solutions won't use more than one compute command queue. This is more or less a safe way to start with D3D12.
NV just use a more limited hardware solution than AMD. Their hardwares are less stateless, and this means some synchronization strategies will be non-optimal for them. In worst-case scenarios it may harm the performance.

I think the multi-engine feature is one of the most useful thing in D3D12, but most of the time we talk about theories and not practice. With consoles and with Mantle an async shader solution is easy, because the program will target a single hardware or some very similar architectures. With D3D12 a multi-engine implementation must target a huge amount of very different architectures, and it is unknown that it will work or not. Things can get even worse with undocumented architectures, like all Geforces. Luckily D3D12 has a robust multi-engine solution where graphics is a superset of compute, and compute is a superset of the copy engine. This means the program can use the best engine for the actual pipeline, but the driver can execute it differently. For example a compute pipeline can be loaded in to the compute queue to execute it asynchronously with a graphics task with the correct synchronization. It may run faster, but may not, and there is a minimal chance that the async scheduling will affect the performance negatively. In this case the IHV can create a driver profile for the game, and the compute pipeline can be loaded to the graphics queue.
 
Last edited:

shady28

Platinum Member
Apr 11, 2004
2,520
397
126
And your posts doesnt add anything to the technical conversation of GCN vs Maxwell 2.0

If you have to add any technical information by all means do so, attacking Silverforce11 is not adding anything to the conversation. In fact, this is forbidden and considered as off topic, personal attack, trolling etc.

The problem with the technical conversation here is that none of the folks having that conversation are qualified. If they were, they'd be writing code to test their theories and using something like GPUView to see where the bottleneck is.

This "discussion" is like laymen talking about the benefits of one type of brain surgery over another based on the results of a single operation on a very specific type of illness.

So this whole thread has basically devolved into AMD vs Nvidia, with the usual actors involved. Very little factual information has been posted, and what has been posted is drowned in that pointless conflict.
 
Feb 19, 2009
10,457
10
76
The problem with the technical conversation here is that none of the folks having that conversation are qualified. If they were, they'd be writing code to test their theories and using something like GPUView to see where the bottleneck is.

This "discussion" is like laymen talking about the benefits of one type of brain surgery over another based on the results of a single operation on a very specific type of illness.

So this whole thread has basically devolved into AMD vs Nvidia, with the usual actors involved. Very little factual information has been posted, and what has been posted is drowned in that pointless conflict.

Fully disagree. I've learnt a ton already with this thread because I go to the source of people's links and read it for myself. So rather than taking the advice of the laymen, the info from the source comes from people directly involved with game engines & DX12.

Also, @Zlatan thanks for the valued inputs!
 

VR Enthusiast

Member
Jul 5, 2015
133
1
0
Correction, NVIDIA told reviewers not to use MSAA because they felt it was a bug in the benchmark/game. It's been 100% proven to be an Nvidia driver issue. Nvidia has acknowledged it as their issue to fix in the last week or so. This is from the AoTS forums.

Do you have a link to Nvidia acknowledging it's their issue?

Maybe come back and revisit the subject when the game doesn't crash Nvidia's video drivers. Yes, this is the video driver crashing over and over - had to do a hard reboot :

Was this reported anywhere in the press? I know at least one website used a GTX 960 but they didn't report multiple crashes.
 
Last edited:

VR Enthusiast

Member
Jul 5, 2015
133
1
0
The problem with the technical conversation here is that none of the folks having that conversation are qualified. If they were, they'd be writing code to test their theories and using something like GPUView to see where the bottleneck is.

Don't you think if it were that simple Nvidia would have done it and fixed their problem? Assuming it is fixable.

So this whole thread has basically devolved into AMD vs Nvidia, with the usual actors involved. Very little factual information has been posted, and what has been posted is drowned in that pointless conflict.

I think it's been an interesting thread with some good discussion. It's only been in the last day or two that one or two people have tried to wreck it.
 

Goatsecks

Senior member
May 7, 2012
210
7
76
And your posts doesnt add anything to the technical conversation of GCN vs Maxwell 2.0

If you have to add any technical information by all means do so, attacking Silverforce11 is not adding anything to the conversation. In fact, this is forbidden and considered as off topic, personal attack, trolling etc.

I am simply criticising his posts and my criticism has been fair: he always has a heavy bias in his posts. I am not character assassinating.

So this whole thread has basically devolved into AMD vs Nvidia, with the usual actors involved.

Exactly.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
"The reason for this is because GCN is an In-Order processor (instructions are fetched, executed & completed in order they are issued. If an instruction stalls, they it causes other instructions ‘behind it’ to stall also), and thus ensuring the pipeline is running smoothly is critical for best performance."

First time I seen anyone say this of GCN.

AT said right in the beginning.

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

But before we go any further, let’s stop here for a moment. Now that we know what a CU looks like and what the weaknesses are of VLIW, we can finally get to the meat of the issue: why AMD is dropping VLIW for non-VLIW SIMD. As we mentioned previously, the weakness of VLIW is that it’s statically scheduled ahead of time by the compiler. As a result if any dependencies crop up while code is being executed, there is no deviation from the schedule and VLIW slots go unused. So the first change is immediate: in a non-VLIW SIMD design, scheduling is moved from the compiler to the hardware. It is the CU that is now scheduling execution within its domain.

So what can you do with dynamic scheduling and independent SIMDs that you could not do with Cayman’s collection of SPs (SIMDs)? You can work around dependencies and schedule around things. The worst case scenario for VLIW is that something scheduled is completely dependent or otherwise blocking the instruction before and after it – it must be run on its own. Now GCN is not an out-of-order architecture; within a wavefront the instructions must still be executed in order, so you can’t jump through a pixel shader program for example and execute different parts of it at once. However the CU and SIMDs can select a different wavefront to work on; this can be another wavefront spawned by the same task (e.g. a different group of pixels/values) or it can be a wavefront from a different task entirely.

Cayman had a very limited ability to work on multiple tasks at once. While it could consume multiple wavefronts from the same task with relative ease, its ability to execute concurrent tasks was reliant on the API support, which was limited to an extension to OpenCL. With these hardware changes, GCN can now concurrently work on tasks with relative ease. Each GCN SIMD has 10 wavefronts to choose from, meaning each CU in turn has up to a total of 40 wavefronts in flight. This in a nutshell is why AMD is moving from VLIW to non-VLIW SIMD for Graphics Core Next: instead of VLIW slots going unused due to dependencies, independent SIMDs can be given entirely different wavefronts to work on.
Tasks can be moved around but will be executed as compiled - when a task is limited by dependencies, unused resources can be used for another task.

Execution (pipeline) is completely in order. GCN has a dynamic scheduler.

Would be impossible for AMD's VR implementation.

But the article did have this to say:

"We can speculate that there’s a good chance Microsoft did indeed customize the Command Processor(s) to an extend, because of a quote from one of the Xbox One’s architects, Andrew Goossen with EuroGamer: “We also took the opportunity to go and highly customise the command processor on the GPU."

By that definition, GCN is no longer an in-order pipeline uarch.
You must have missed this part.

The reason for this is because GCN is an In-Order processor (instructions are fetched, executed & completed in order they are issued. If an instruction stalls, they it causes other instructions ‘behind it’ to stall also), and thus ensuring the pipeline is running smoothly is critical for best performance.
Sony also customized GCN for the purpose of strong Async Compute performance (as noted by the links already posted several times in this thread).

This was back in 2009/2010!

Also from your quote, this is the important part:

It therefore doesn't need a context switch because it has 8x ACEs that run it in parallel to the main graphics.

But that's not possible in DX11 because the ACEs are not accessed by the API. GCN is an in-order uarch applies to DX11 but will change in DX12.

This pretty much lines up with what people have said about the ACEs being idle in DX11 and GCN can't flex its muscles!
The article says nothing like what you are saying.

The typical desktop GCN architecture according to documentation processes a single context at a time. Naturally (as we’ve just seen above), it’s possible to operate on multiple, but to do this you’ll need to run them serially and context switch. GCN also process compute, and can process as contexts based on the number of ACE’s that are available.
Nothing there says that it doesn't need a context switch. The article to me appears to be about the ACE units prepping and commanding the contexts for execution not switching between them.
 
Feb 19, 2009
10,457
10
76
@Enigmoid

Did you miss this in the article? This passage isn't from the article writer summing up his knowledge of GCN, but a direct quote from people involve in designing GCN:

“In particular, compute tasks can leapfrog past pending rendering tasks, enabling low-latency handoffs between CPU and GPU.”

If the designers of GCN says it can achieve this, by definition, it's no longer an in-order execution engine. Now, you can claim they lied, but to go on their words, GCN is definitely not serial/in-order. Here's the definition, one of the criteria for out-of-order engines: http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf

independent instructions behind a stalled instruction can pass it

Here's one of the criteria for an in-order:

"one stalls, they all stall" >>> This is also what NV themselves have said regarding async compute usage (PDF links in the above posts), it's stuck in traffic if there's a graphics task in front.

GCN also process compute, and can process as contexts based on the number of ACE’s that are available.

The reason GCN doesn't have a context switch overhead, is because 8x ACEs can only do compute. It exists only for compute, there's no need to switch context in those 8 pipelines. GCN has 1 separate CP for handling graphics queues.

Compare that to NV's uarch where they have 1 engine that shares rendering & compute queues.

Edit: @Goatsecks If you have nothing of value to contribute but besides personal attacks against me, then take it to PM, your actions are spoiling yet another informative thread.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
@Enigmoid

Did you miss this in the article? This passage isn't from the article writer summing up his knowledge of GCN, but a direct quote from people involve in designing GCN:

If the designers of GCN says it can achieve this, by definition, it's no longer an in-order execution engine. Now, you can claim they lied, but to go on their words, GCN is definitely not serial/in-order. Here's the definition, one of the criteria for out-of-order engines: http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf

Here's one of the criteria for an in-order:

"one stalls, they all stall" >>> This is also what NV themselves have said regarding async compute usage (PDF links in the above posts), it's stuck in traffic if there's a graphics task in front.

GCN does not have true out of order execution. It has a dynamic scheduler. Read up on it.

It can rearrange tasks, a property of out of order execution. However, it cannot rearrange instructions in a wavefront. Once the wavefront is prepped and in the pipeline execution is strictly in order.

It simply is not that granular. There is no real need for an out of order pipeline in graphics - OoO is simply too power hungry for this type of work.

Now in a GPU tasks are exploited in a way to make them parallel. Instructions and tasks are broken up into smaller instructions that individual parts (the CU and the SIMD unit) can handle. What GCN does is that the SIMD units can be sent off to operate on different wavefronts if there are dependencies.

They can be given other, in order parallel instructions.