Quantum Break: More like Quantum Broken.

Silverforce11 · Apr 14, 2016

Mahigan said:
Afaik, GM20x does support Parallel Copy and Graphics commands though both are handled by the driver (Static Scheduler). For GM20x, everything is placed into a single queue (3D/Graphics queue) and then the static scheduler handles the hardware assignments of tasks.

Sure, in CUDA maybe. But it isn't compatible with DX12.

That's where their problem lies. DX12 treats Copy Queues as a subset of Compute.

Because the hardware is incapable of having parallel graphics + compute queues at all, it's copy queue is stalled, running in serial mode like the rest of the queues.

As such, for a game that relies so much on Copy queues for it's 4x prior frame "temporal" reconstruction, performance tanks.

TT should have made a GPUView result for Titan X, it would be very clear, every queue is being shoved into the graphics pipeline just like in Fable.

Vaporizer · Apr 14, 2016

There will be a driver that will fix this issue with async compute for Maxwell in the future. And this is not a lie. And 970 has fully connected 4GB VRAM. And Kepler Chips have full DX11.2 support in Hardware...

Mahigan · Apr 14, 2016

Silverforce11 said:
Sure, in CUDA maybe. But it isn't compatible with DX12.

That's where their problem lies. DX12 treats Copy Queues as a subset of Compute.

Because the hardware is incapable of having parallel graphics + compute queues at all, it's copy queue is stalled, running in serial mode like the rest of the queues.

As such, for a game that relies so much on Copy queues for it's 4x prior frame "temporal" reconstruction, performance tanks.

TT should have made a GPUView result for Titan X, it would be very clear, every queue is being shoved into the graphics pipeline just like in Fable.

DX12 has 3 contexts, Graphics, Compute and Copy. These map perfectly to GCNs dedicated logic units.

For GM20x, there is a middle man involved. This middle man is the Static Scheduler (software scheduling). Therefore for GM20x, you can pretty much picture DX12 handing over all of the commands to some middle man named Mr Static Scheduler. This middle man will optimize and reorganize the commands in order to ensure a more efficient execution by Mr GM20x. GCN lacks this middle man. Commands are simply translated and sent to the command buffer, with context attached. From the command buffer, they're fed to the GPU which then, in hardware, assigns the tasks to the correct logic units based on their context flags.

GM20x can thus handle Copy Commands in parallel with Compute and/or Graphics as the available logic units are in place. It's when it comes to mixing compute and graphics that things get interesting, in CUDA, compute and graphics is handled by an on chip ARM core which is not compatible with DX12. Absent this logic unit, compute and graphics cannot be executed in parallel.

So to summarize, Copy commands can be executed in parallel, absent CUDA, the Static scheduler can handle the assignment of those tasks because there are available logic units (DMA engines).

ShintaiDK · Apr 14, 2016

Vaporizer said:
And Kepler Chips have full DX11.2 support in Hardware...

11.2 was the AMD GCN 1.0 lie since 7970. For NVidia it was another. In short, both lies.

Silverforce11 · Apr 14, 2016

Mahigan said:
DX12 has 3 contexts, Graphics, Compute and Copy. These map perfectly to GCNs dedicated logic units.

For GM20x, there is a middle man involved. This middle man is the Static Scheduler (software scheduling). Therefore for GM20x, you can pretty much picture DX12 handing over all of the commands to some middle man named Mr Static Scheduler. This middle man will optimize and reorganize the commands in order to ensure a more efficient execution by Mr GM20x. GCN lacks this middle man. Commands are simply translated and sent to the command buffer, with context attached. From the command buffer, they're fed to the GPU which then, in hardware, assigns the tasks to the correct logic units based on their context flags.

GM20x can thus handle Copy Commands in parallel with Compute and/or Graphics as the available logic units are in place. It's when it comes to mixing compute and graphics that things get interesting, in CUDA, compute and graphics is handled by an on chip ARM core which is not compatible with DX12. Absent this logic unit, compute and graphics cannot be executed in parallel.

So to summarize, Copy commands can be executed in parallel, absent CUDA, the Static scheduler can handle the assignment of those tasks because there are available logic units (DMA engines).

We have not witness proof of that yet, so do not be so sure. Can't trust paper specs.

If you watch the video as well as SIGGRAPH presentation, they stress that DX12 treats Copy Queues as a sub-set of Compute. It is still a compute task.

Btw, IIRC, there's an issue with only 1/2 DMA engines being enabled on GTX SKUs. Only Teslas have both enabled. I read this awhile ago from a tech site, does this sound familiar to you?

Mahigan · Apr 14, 2016

Silverforce11 said:
We have not witness proof of that yet, so do not be so sure. Can't trust paper specs.

If you watch the video as well as SIGGRAPH presentation, they stress that DX12 treats Copy Queues as a sub-set of Compute. It is still a compute task.

Btw, IIRC, there's an issue with only 1/2 DMA engines being enabled on GTX SKUs. Only Teslas have both enabled. I read this awhile ago from a tech site, does this sound familiar to you?

I read about the GTX SKUs having only 1/2 DMA engines enabled by folks posting in forums but nothing official. The last time I read up on the topic, relative to NVIDIA Architectures, was with GK110.

moonbogg · Apr 14, 2016

Guru3D reports this morning that this POS game won't support MGPU at all. What a pile of...oh wait, most new games don't support MGPU either. Hmm...

therealnickdanger · Apr 14, 2016

My last enjoyable mGPU setup was SLI Voodoo2. I more recently tried out my 970s in SLI, but it was not enjoyable. Hopefully we see a resurgence of massive single GPUs.

Well... hopefully we see a resurgence of quality-made PC games... but I think the former is more likely.

poofyhairguy · Apr 14, 2016

therealnickdanger said:
My last enjoyable mGPU setup was SLI Voodoo2. I more recently tried out my 970s in SLI, but it was not enjoyable. Hopefully we see a resurgence of massive single GPUs.

Well... hopefully we see a resurgence of quality-made PC games... but I think the former is more likely.

Two GPUs are very natural for VR. One per eye.

Raduque · May 3, 2016

If anybody still cares, looks like there was a patch on the 30th.

NEW FEATURES

Added Quit Button to Main Menu
Alt+Enter now toggles full screen mode
Option to toggle film grain on/off
Option to toggle upscaling on/off

Wonder if they fixed the frametime issues.

airfathaaaaa · May 3, 2016

27gb for 4 things?

PhonakV30 · May 3, 2016

Yes that's crazy! 27GB ?!

airfathaaaaa · May 3, 2016

behrouz said:
Yes that's crazy! 27GB ?!

i saw it earlier on pc gamer they said a "massive" update is coming

tential · May 3, 2016

airfathaaaaa said:
27gb for 4 things?

Texture on the quit menu are legit.

Bacon1 · May 3, 2016

Raduque said:
If anybody still cares, looks like there was a patch on the 30th.

Wonder if they fixed the frametime issues.

Where did you get that list? Its missing a lot.

General Windows 10 Fixes / Updates:

Fixed a few Unicode issues that prevented some users from launching the game
Fixed various keyboard input issues
Fixed aspect ratio and full screen scaling for non 16:9 resolutions
Enable Alt+Enter to switch between full screen and windowed mode
Fixed resolution selection when transitioning from windowed to fullscreen mode
Fixed frame timing not matching the refresh rate
Fixed options menu items being clickable even if theyre cropped
Fixed 1 px gaps sometimes visible in the PC keyboard key callout backgrounds
DRM fixes
Unlock descriptions for Will Diary 1 and Will Diary 2 are no longer reversed
Credits fixes
Fixed Jacks subtitles not showing in some cinematics
Remedy logo fix
Fix for a rare bug that accidentally wiped progress after completing the game
Fixed rare instances of cloud saves failing and causing loss of progress
Fixes for making Xbox Live integration more fault-tolerant.
Fixed in-game TV screen images which were sometimes grainy
Fixed circular progress bar alpha in the junction stats screen
Fixed rendering issues in the menus
Fixed issue in renderer when initializing participating media
Fixed video playback not always ending if the video was synced to audio

And yes, frametime issue was fixed.

S.H.O.D.A.N. · May 3, 2016

Would be nice to see the game re-benched.

As for the 27 GB update, I'm quite certain it just another WinStore/UWP quirk.

Raduque · May 3, 2016

Bacon1 said:
Where did you get that list? Its missing a lot.

And yes, frametime issue was fixed.

Nice. Found the list on a random reddit.

S.H.O.D.A.N. said:
Would be nice to see the game re-benched.

As for the 27 GB update, I'm quite certain it just another WinStore/UWP quirk.

Maybe 1080p native assets? Since the game was basically designed for 720p, it would make sense (to me) that the assets were designed for 720p also.

Silverforce11 · May 3, 2016

Well, the frame time fixed, but performance still trash on NV GPUs.

980Ti comparison:

https://www.youtube.com/watch?v=o6MRz4KbdI8

At 1080p, in combat scenes it drops to 35 fps, well below a 390.

https://youtu.be/LT8PFbBvbzc?t=3m54s

Their major usage of compute lighting & copy (subset of compute) queues really neuters GPUs that can't handle Async Compute.

TheELF · May 4, 2016

Silverforce11 said:
Their major usage of compute lighting & copy (subset of compute) queues really neuters GPUs that can't handle Async Compute.

Every vendor can do copy while doing anything else,only doing compute while doing graphics (since both are being done by the same shaders) is under debate.

Silverforce11 · May 4, 2016

TheELF said:
Every vendor can do copy while doing anything else,only doing compute while doing graphics (since both are being done by the same shaders) is under debate.

This isn't according to me or you.

This is according to folks in the industry.

https://youtu.be/H1L4iLIU9xU?t=15m12s

I covered it here already.

http://forums.anandtech.com/showpost.php?p=38164220&postcount=349

But a short primer:

Due to the nature of DX11, command queues get processed serially.

In DX12, Copy Queues that use the DMA engines are a sub-set of Compute and it is treated as such.

-------------

^^ Anytime folks say Async Compute is for shaders or relate to it like "both are being done by the same shaders" show they don't actually understand the principle of DX12/Vulkan Async Compute. It's NOT JUST ABOUT SHADERS. There's separate engines in ROPS and DMA that can also perform work independent of Cuda Cores or Stream Processors. Async Compute allows these to run in parallel, if the uarch supports it.

Cookie Monster · May 4, 2016

Silverforce11 said:
Well, the frame time fixed, but performance still trash on NV GPUs.

980Ti comparison:

https://www.youtube.com/watch?v=o6MRz4KbdI8

At 1080p, in combat scenes it drops to 35 fps, well below a 390.

https://youtu.be/LT8PFbBvbzc?t=3m54s

Their major usage of compute lighting & copy (subset of compute) queues really neuters GPUs that can't handle Async Compute.

I think you have to be careful in saying that some GPUs can't handle Async compute. Ive done a fair amount of research on this subject, and from what I gather Async compute is simply running shaders in the compute queue that is independent of the main queue. Think this is why the term "asynchronous" was coined for this. (Although it also seems like you could synchronize it with the main queue but will result in heavy performance loss).

Now all modern day GPUs can handle this. However performance on the other hand is a different story as TheELF has pointed out. This is the part where its heavily debated on. Its quite hard to exactly understand how nVIDIA deals with graphics+compute tasks (parallel) because there is hardly any information on what happens inside their GPUs.

However at the end of the day, this is all to increase GPU utilization which naturally results in better performance (side effects could be perhaps higher power consumption also since more units are active). So Im beginning to think that there is no one approach to the debate. Plus shared resources also plays a big factor and this is but one tiny parameter in the design that affects overall performance (What normally gets discussed here is very high level).

TheELF · May 4, 2016

Silverforce11 said:
^^ Anytime folks say Async Compute is for shaders or relate to it like "both are being done by the same shaders" show they don't actually understand the principle of DX12/Vulkan Async Compute. It's NOT JUST ABOUT SHADERS. There's separate engines in ROPS and DMA that can also perform work independent of Cuda Cores or Stream Processors. Async Compute allows these to run in parallel, if the uarch supports it.

Exactly,copy+light is not Async Compute but multi engine,different parts of the GPU working at the same time.
https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx

Samwell · May 4, 2016

Silverforce11 said:
^^ Anytime folks say Async Compute is for shaders or relate to it like "both are being done by the same shaders" show they don't actually understand the principle of DX12/Vulkan Async Compute. It's NOT JUST ABOUT SHADERS. There's separate engines in ROPS and DMA that can also perform work independent of Cuda Cores or Stream Processors. Async Compute allows these to run in parallel, if the uarch supports it.

I'll take this description of Async Compute from Sebbi at b3d, as he knows more what he's talking about than anyone here and i understand his post, that it's all about shaders.

It seems that people are still confusing terms "async compute", "async shaders" and "compute queue". Marketing and press doesn't seem to understand the terms properly and spread the confusion

Hardware:
AMD: Each compute unit (CUs) on GCN can run multiple shaders concurrently. Each CU can run both compute (CS) and graphics (PS/VS/GS/HS/DS) tasks concurrently. The 64 KB LDS (local data store) inside a CU is dynamically split between currently running shaders. Graphics shaders also use it for intermediate storage. AMD calls this feature "Async shaders".

Intel / Nvidia: These GPUs do not support running graphics + compute concurrently on a single compute unit. One possible reason is the LDS / cache configuration (GPU on chip memory is configured differently when running graphics - CUDA even allows direct control for it). There most likely are other reasons as well. According to Intel documentation it seems that they are running the whole GPU either in compute mode or graphics mode. Nvidia is not as clear about this. Maxwell likely can run compute and graphics simultaneously, but not both in the same "shader multiprocessor" (SM).

Async compute = running shaders in the compute queue. Compute queue is like another "CPU thread". It doesn't have any ties to the main queue. You can use fences to synchronize between queues, but this is a very heavy operation and likely causes stalls. You don't want to do more than a few fences (preferably one) per frame. Just like "CPU threads", compute queue doesn't guarantee any concurrent execution. Driver can time slice queues (just like OS does for CPU threads when you have more threads than the CPU core count). This can still be beneficial if you have big stalls (GPU waiting for CPU for instance). AMDs hardware works a bit like hyperthreading. It can feed multiple queues concurrently to all the compute units. If a compute units has stalls (even small stalls can be exploited), the CU will immediately switches to another shader (also graphics<->compute). This results in higher GPU utilization.

You don't need to use the compute queue in order to execute multiple shaders concurrently. DirectX 12 and Vulkan are by default running all commands concurrently, even from a single queue (at the level of concurrency supported by the hardware). The developer needs to manually insert barriers in the queue to represent synchronization points for each resource (to prevent read<->write hazards). All modern GPUs are able to execute multiple shaders concurrently. However on Intel and Nvidia, the GPU is running either graphics or compute at a time (but can run multiple compute shaders or multiple graphics shaders concurrently). So in order to maximize the performance, you'd want submit large batches of either graphics or compute to the queue at once (not alternating between both rapidly). You get a GPU stall ("wait until idle") on each graphics<->compute switch (unless you are AMD of course).

https://forum.beyond3d.com/posts/1911098/

Silverforce11 · May 4, 2016

@Samwell

That's not exclusive. That's talking about compute shaders.

I'm talking about multi-engine in DX12/Vulkan. If the uarch supports separate graphics + compute queues concurrently, the ROPs and DMA engines can perform tasks while the shaders (CU/SIMDs, SM/CC etc) run graphics or compute.

People always assume DX12 Async Compute is just about getting shaders to run graphics/compute workloads concurrently, but it's just a part of the bigger function.

http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?print=1

If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it’s entirely done by vertex shaders and the rasterization hardware -- so graphics aren't using most of the 1.8 teraflops of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now.'"

You can have concurrent compute/copy queues being processed without actually using ALUs ("shaders/SP" within a SIMD and CU) at all.

Calling it Async Compute is actually not accurate, it should be called Multi-Engine (Graphics, Compute, Copy).

Silverforce11 · May 4, 2016

Also, for relevance.

@Samwell

All modern GPUs are able to execute multiple shaders concurrently. However on Intel and Nvidia, the GPU is running either graphics or compute at a time (but can run multiple compute shaders or multiple graphics shaders concurrently). So in order to maximize the performance, you'd want submit large batches of either graphics or compute to the queue at once (not alternating between both rapidly). You get a GPU stall ("wait until idle") on each graphics<->compute switch (unless you are AMD of course).

^ This limitation is aka "slow context switch". It's what I've talked about when I say games with lots of compute will tank on Kepler (a lot) and Maxwell. This is also why NV's VR cannot do true preemption for async timewarp. Why is it a 780Ti which can be faster than a 970 does not even qualify for minimum VR? It will frequently miss the Async Timewarp, causing stutter/latency and nausea, much worse than Maxwell even.

Pascal fixes these issues, it no longer needs to do a slow context switch and it supports fine-grained preemption. It's GCN-like.

In relation to Quantum Break, using GPUView, we see that it submits many DX12 Copy queues while Graphics/Compute queues are also in-flight. Refer to my & @Mahigan's post a few pages back. It shouldn't be doing it like that for NV GPU because it gimps them. The context switch kills their performance. My prediction is Pascal will show really strong performance in Quantum Break.

Quantum Break: More like Quantum Broken.

Lifer

Member

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Lifer

Lifer

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member

Lifer

Lifer