4. A scientific paper confirmed my suspicions about Maxwell's lack of parallelism. Theoretically speaking, each Maxwell SM is capable of 64 concurrent Warps and each Warp being made up of 32 threads for a total of 2,048 threads per SM. Sadly, this is not the case in practice. In practice Maxwell loses performance once we move higher than 16 concurrent Warps per SM. This pits the maximum threads per SM, before a performance drop off, at 512 threads.
Maxwell is thus not a good candidate for Asynchronous compute + graphics even if its static scheduler could emulate the process. On top of that, Maxwell's static scheduler hits the CPU hard when attempting to emulate Asynchronous compute + graphics as revealed by Dan Baker of Oxide:
5&6. In order to understand what I mean by ROp to cache or ROp to Memory Controller ratio, we need to look at a schematic of GM107.
GM20x differs from GM107 in that NVIDIA increased the ROps ratio from 8:1 to 16:1. So lets look at both GM204 and GM200.
GM204
- 64 ROps divided by 16 = 4.
- 2MB of L2 cache divided by 4 = 512KB.
- 256bits/4 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).
GM200
- 96 ROps divided by 16 = 6.
- 3MB of L2 cache divided by 6 = 512KB.
- 384bits/6 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).
The result is that there isn't enough bandwidth to feed these ROps and they're consistently 10GPixel/s behind their theoretical throughput. This is without any other work straining the memory controller or L2 cache as seen here:
NVIDIA thus, knowing this was a limitation, invested heavily in color compression algorithms in order to reach parity, or near parity, with Fiji and its 64 ROps as seen here:
This issue is further compounded by the inefficient memory controllers used by GM20x. NVIDIA had to sacrifice efficiency in order to keep die size down and power usage low as seen here:
Fiji and Hawaii use the following ROp setup:
Fiji is different from Hawaii because it has a 4096bit bus divided by 8 or 512bit memory controller per 8 ROps (GP100 will have 16 ROps per 512bit memory controller) and 256KB L2 cache per 8 ROps vs 128KB L2 cache per 8 ROps on Hawaii.
That blue bar above all the Render Back Ends links them all together (pipelined ROps) for resource sharing and each Render Back End has Depth (Z)/Stencil cache as well as Color cache and, on top of that, access to a Global Data Share cache.
GCN Hawaii/Fiji ROps are thus not bandwidth starved compared to Maxwell's. This is why Fiji can compete with GM200 at 4K resolution.
7. And finally P100, P100 has moved towards 64 SIMD FP32 cores per SM, vs 128 in GM20x and 192 in Kepler. This is just like GCNs 64 SIMD cores per CU. P100 has also moved towards 4 Texture Mapping Units per SM, again like GCN CU. With less logic per SM, but the same degree of local cache per SM as GM20x, then P100 will be less likely to suffer cache spills into L2 when running concurrent Warps. This means that, like GCN, P100 is a highly parallel architecture. To top it off, P100 will use Warps with a max thread sizing of 64, just like a GCN Wavefront. So NVIDIA has moved, as I suspected they would, towards a more GCN-like architecture.
This quote, from NVIDIA, confirms what I had been saying about GM20x:
Overall shared memory across the GP100 GPU is also increased due to the increased SM count, and aggregate shared memory bandwidth is effectively more than doubled. A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).
Source:
https://devblogs.nvidia.com/parallelforall/inside-pascal/
Conclusion:
We're looking at GM20x remaining as the better DX11 GPU of this generation not due to hardware superiority but rather RTGs software inferiority.
As DX12 titles release, GM20x will begin to lose steam the same way Kepler did and will be surpassed by its GCN competitors.
NVIDIAs new architecture, Pascal, will likely turn out to be quite successful and is not likely to suffer the shorter life spans of its Kepler and Maxwell ancestors by virtue of being more GCN-like.
Performance/watt? Knowing all of this, it would seem to me that Polaris/Vega and Pascal will be quite similar in performance/watt due to their overall similarities.
Polaris/Vega will bring Asynchronous compute + Graphics into the mix which will offset the higher power usage of its extra logic (ACEs, caches and Hardware Scheduling). Each aspect of Polaris/Vega will be tuned for higher efficiency thus while Polaris/Vega won't likely arrive with more SIMD cores, than their predecessor's, they will instead raise the IPC of each one.
The end result? Similar TDP and performance from RTG and NVIDIA architectures with RTGs Polaris/Vega maybe even being slightly lower wattage for the performance output.