Great post Mahigan,It raised one question for me though
Isn't it technichally 3 queues or "threads" as you put it?
It's my understanding that Multi engine also introduces a copy queue that has several benefits it brings to the table regarding resource binding and other aspects of memory management, or am i mistaken and copy tasks were previously able to be done seperately from work in the graphics pipeline?
(this is an honest question I am not goading you)
I'd also like to ask for your opinion on a couple things.
Do you think the phenomena commonly referred to as the console effect is in any way related to the fact that we are starting to see games that are primarily created to work well on the current consoles and they are no longer having to create optimized codepaths for the PS3/360?
Could it be that the engines for previous games made compromises to suit porting/optimizing for the old consoles within the time constraints of release deadlines and now that that burden is lifted we are seeing the benefits of console optimization transfer over more as the consoles are X86, GCN machines,with one of them already running a version of DX11?
Everybody can do the graphics plus copy only the graphics plus compute is in question.
And yes, MS is talking about it's xbone, that's where the money's at.
The cores in the xbone are unable to run the Dx11 driver and get the GPU utilized enough to get good FPS,even with Dx12 they are pretty slow,so of course MS is trying to get everybody to use Dx12 as well as possible.
People on (not only) this forum transfer these ideas to benchmarks being done on the fastest intel cores available,it just doesn't work like this.
Nvidia has huge performance gains with Dx12 on slower cores just like amd has on any core since they need so much more speed to utilize their GNC under Dx11,that's what driver overhead really means,you get better speed with slower CPUs not you get more speed no matter what.
It is 3 queues but both NV Maxwell and GCN have two DMA Engines allowing them to process Copy commands in parallel with Graphics and Compute tasks.
As for the console effect, if you look at console titles released in the beginning of 2015 that made their way to the PC (Dying Light for example), they were primarily Graphics rendering heavy. They started to hit a wall on the consoles due to the relatively weak APUs on board. In an attempt to extract as much performance as possible, developers migrated towards using the Teraflops of compute power on tap as a means of tackling work which was traditionally handled by the Graphics Rendering pipeline. This led to the console releases by the end of 2015 (Star Wars Battlefront for example) which made their way onto the PC. Having extracted that much performance, developers began looking at ways of reducing the frame time latency on console titles. For the XBox One, that mean't DX12 and the use of Asynchronous Compute + Graphics (the PS4 was already using this technique as far back as Battlefield and Thief).
The extra compute work led to an increase in the compute to graphics ratio in these titles which made its way onto the PC (devs would have had to rewrite the entire code into Graphics Rendering workloads to retain the same Graphics to compute ratio which would have increase development time and costs).
Incidentally, we have noticed that console ports from end of 2015 and on, have performed favorably on GCN even if they are mere DX11 titles on the PC. The DX12 titles should perform even more favorably towards GCN as DX12 alleviates the API overhead which afflicted GCN GPUs.
We even have titles, on the horizon, which have moved almost entirely towards the compute pipeline (PS4's Dreams).
That's the shift we are seeing and what some refer to as the "console effect". Given that AMD have won the contract for the PS4.5 and new Nintendo consoles, and most likely will win the contracts for the next PlayStation and XBox consoles, then this trend is not likely to change.
When I state that NVs architectures will need to be more GCN-like, I'm primarily focusing on increased Compute throughput and parallelism. Incidentally, we see Pascal taking that route (GP100 is likely to be a 12 Teraflops monster). I am expecting more cache redundancy in Pascal and Volta. For Volta, I'm expecting Asynchronous Compute + Graphics support. I don't think Asynchronous Compute + Graphics is a fluke or a fad. Imo, it's here to stay the same way SMT is here to stay on the CPU side.