Saylick
Diamond Member
- Sep 10, 2012
- 4,152
- 9,691
- 136
@Saylick
SP and Cuda Cores normally have individual math units like scalar and vector, AFAIK, they can only run 1 workload/thread at a time. So the 16 wide vector ALU for example, if it's running work that requires 2-8 wide ALU, it's under-performing for it's design.
There's no multi-threading at the SP/CC level currently.
What is NV's utilization advantage is DX11 Multi-Thread Rendering, so let's say for GM200 and it's 3,072 CC, most of it will be utilized because it's fed with the Gigathread engine using multiple CPU cores to schedule the tasks.
Fury X for example, 4,096 SP, but at low resolution where the threads get processed faster, basically the SPs run out of work to do because the single-thread driver can't feed enough work to the Command Processor to keep those shaders utilized. At higher resolution where the workload increase and takes longer to process, at any one time, more of the SPs are running, hence it scales better at higher resolutions.
DX12 completely solves that aspect already.
This GCN with SMT and power gating/boost is another beast altogether. Imagine on Vega, take the 4096 SP and now each SP is capable of putting 2 scalar threads, and up to 4 vector threads processing concurrently. Massive uplift in work per SP. Ofc this is the maximum ideal scenario where there's workloads that can be distributed for 2,4,8,16 wide vector ALU and 2 scalar ALU. In games, it won't be x6 threads as gaming load tend to be repetitive and consistent. The kicker here is that compute workloads can vary more with post effects that need various different maths.
The cool thing is if there's let's say workloads that only need 8 wide vector ALU and 1 scalar, it can power down the other ALUs, boost the 8 wide and scalar unit to higher clocks to achieve the task faster.
It's a win-win scenario and a really clever design, ON PAPER.
On paper it sounds like the next coming of jesus for GPU tech, it really is that good. But let's see how it's actually translated to reality. I'm pumped, this is the most exciting GPU uarch launch in a very long time.
I'm not entirely sure that this patent is enabling "multi-threading" at the SP level, but we may be agreeing on the same thing. I think you're right to coin the concept as an SMT-esque approach in that what normally would be underutilized SPs are now being fed with useful work, very much in line with how unused execution units in Intel's CPUs may be used by another thread if the resources are mutually exclusive. If anything, I would argue that this "GPU-SMT' happens at the CU level.
With regards to DX11, we're talking about two constraints which holds back the GPU from 100% SP utilization:
1) Driver/CPU side: If the driver can't issue enough work faster than the GPU can process the work, you will be under-utilizing the computational resources of the GPU. This is the issue with AMD's DX11 implementation that nVidia does not have an issue with. DX12 will alleviate this issue.
2) Workload-dependent/GPU side: If a hypothetical workload was highly mixed, highly scrambled such that the GPU only receives sub-16-thread vector math, then the GPU will be under-utilizing it's resources since it was built for 16-thread wide vector operations. The patent aims to alleviate this issue by enabling A) a more granular break-down of vector-widths such that a mixed workload may be more easily divided amongst all ALUs without having unused resources, and/or B) power gating any unused ALUs and using the available power budget to power up any ALUs doing useful work.
But again, these two issues, if resolved, only help with maximizing utilization and thus allowing the GPU to have an expected performance that is closer to peak performance.
