- Nov 14, 2011
- 10,405
- 5,651
- 136
First off- please, keep this to technical chat. I don't want fanboy nonsense cluttering up the thread.
So there's plenty of evidence that Kepler is not doing as well on recent games as its contemporary AMD GPU, Hawaii- for example this thread. What I haven't seen is much well-informed speculation as to why.
Here's my theory:
Instruction Level Parallelism
On a fundamental level, the Kepler architecture needs shader code to provide ILP in order to achieve full occupancy. There is an imbalance between the number of warp schedulers in a Kepler SMX and the number of CUDA cores. Here's the relevant part from the Kepler Tuning Guide:
An SMX has 4 warp schedulers (which can issue up to two instructions per cycle), and 192 CUDA cores. Each instruction can provide work to 32 CUDA cores (as each warp is 32 threads wide), meaning that in order to saturate all 192 cores there must be at least 6 instructions issued. The only way that a warp scheduler can issue more than 1 instruction per cycle is if there is ILP present, hence ILP is needed to fully utilize the SMX.
Maxwell and Pascal do not have this limitation. They still have 4 schedulers per SM, but each SM only has 128 CUDA cores, meaning that full utilization can be attained without ILP. From the Maxwell Tuning Guide:
This is a fundamental feature of the Kepler architecture; so why has it only become a "problem" recently? My suspicion is that this is because developers (both game devs, and NVidia engineers who provide optimization assistance to game devs through e.g. Gameworks) are no longer focusing on optimizing shader code for ILP. Maximising ILP requires tricks like loop-unrolling, and since newer NVidia GPUs no longer need these tricks, they aren't being carried out as often.
Could someone who knows more about GCN than me comment on instruction issue and ILP for that architecture, and whether it also doesn't need ILP to achieve maximum performance? I remember AMD saying that one reason why they moved away from VLIW was to reduce the need for ILP in shader code, but I don't know much more than that. If GCN also doesn't need ILP, then that would explain why shaders originally written for consoles don't run well on Kepler.
TLDR: Kepler requires optimization tricks that newer architectures don't, and I suspect newer games are not using those tricks.
So there's plenty of evidence that Kepler is not doing as well on recent games as its contemporary AMD GPU, Hawaii- for example this thread. What I haven't seen is much well-informed speculation as to why.
Here's my theory:
Instruction Level Parallelism
On a fundamental level, the Kepler architecture needs shader code to provide ILP in order to achieve full occupancy. There is an imbalance between the number of warp schedulers in a Kepler SMX and the number of CUDA cores. Here's the relevant part from the Kepler Tuning Guide:
https://docs.nvidia.com/cuda/kepler-tuning-guide/index.htmlAlso note that Kepler GPUs can utilize ILP in place of thread/warp-level parallelism (TLP) more readily than Fermi GPUs can. Furthermore, some degree of ILP in conjunction with TLP is required by Kepler GPUs in order to approach peak single-precision performance, since SMX's warp scheduler issues one or two independent instructions from each of four warps per clock. ILP can be increased by means of, for example, processing several data items concurrently per thread or unrolling loops in the device code, though note that either of these approaches may also increase register pressure.
An SMX has 4 warp schedulers (which can issue up to two instructions per cycle), and 192 CUDA cores. Each instruction can provide work to 32 CUDA cores (as each warp is 32 threads wide), meaning that in order to saturate all 192 cores there must be at least 6 instructions issued. The only way that a warp scheduler can issue more than 1 instruction per cycle is if there is ILP present, hence ILP is needed to fully utilize the SMX.
Maxwell and Pascal do not have this limitation. They still have 4 schedulers per SM, but each SM only has 128 CUDA cores, meaning that full utilization can be attained without ILP. From the Maxwell Tuning Guide:
https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.htmlEach warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.
This is a fundamental feature of the Kepler architecture; so why has it only become a "problem" recently? My suspicion is that this is because developers (both game devs, and NVidia engineers who provide optimization assistance to game devs through e.g. Gameworks) are no longer focusing on optimizing shader code for ILP. Maximising ILP requires tricks like loop-unrolling, and since newer NVidia GPUs no longer need these tricks, they aren't being carried out as often.
Could someone who knows more about GCN than me comment on instruction issue and ILP for that architecture, and whether it also doesn't need ILP to achieve maximum performance? I remember AMD saying that one reason why they moved away from VLIW was to reduce the need for ILP in shader code, but I don't know much more than that. If GCN also doesn't need ILP, then that would explain why shaders originally written for consoles don't run well on Kepler.
TLDR: Kepler requires optimization tricks that newer architectures don't, and I suspect newer games are not using those tricks.