That is only the case for very short loops. Longer code like in OpenCL kernels typically has critical paths the out-of-order execution engine can't schedule around. Sandy Bridge has a 54 entry scheduler, and while that may seem sufficient keep in mind that the average arithmetic instruction latency is about 4 cycles and you're really aiming for 3 instructions per cycle. Last but not least, cache misses severely limit the number of instructions that can be scheduled. And while prefetching attempts to avoid it, note that it comes at a cost in power consumption.
GPUs hardly suffer from this because they hide arithmetic latency by executing very wide vectors over multiple cycles. AVX-1024 would achieve the same thing for CPUs. The sustained throughput would be higher, while the power consumption would be lower.
The Intel compiler can do
SWP and
unrolling to hide latencies. And I'm pretty sure LLVM-based compilers have similar capabilities.
That said, it's not very effective in practice because you easily run out of registers and the additional spill/restore instructions negate what you gain from latency hiding. Adding more registers to x86 is not feasible, but with AVX-1024 you don't have to. Each instruction implicitly accesses four 256-bit registers.
Take for example this pseudo-code:
Code:
for(i = 0; i < 1000; i++)
{
t = a[i] * b[i]; // Takes 4 cycles to complete
c[i] = t + d[i]; // Has to wait on the multiplication
}
For simplicity's sake there's only two operations, but assume that in a real-world example there's lots more code inside the loop and the out-of-order execution engine can't execute full iterations simultaneously due to long dependency chains. To hide the latency of the dependency in our simplified example, let's unroll the loop four times:
Code:
for(i = 0; i < 1000; i += 4)
{
t0 = a[i+0] * b[i+0];
t1 = a[i+1] * b[i+1];
t2 = a[i+2] * b[i+2];
t3 = a[i+3] * b[i+3];
c[i+0] = t0 + d[i+0];
c[i+1] = t1 + d[i+1];
c[i+2] = t2 + d[i+2];
c[i+3] = t3 + d[i+3];
}
There's no more waiting required. Note however that more temporary variables are used. In a more complex real-world example, you'd easily exceed the available architectural register count. Thus there would be extra store and load instructions required for spilling registers, which make things slower again.
To not require more registers, we could instead use vector registers to do the same thing:
Code:
for(i = 0; i < 1000; i += 4)
{
v = a[i+0..3] * b[i+0..3]; // Vector multiplication
c[i+0..3] = v + d[i+0..3]; // Vector addition, has to wait
}
Great, the number of registers has been reduced and we're doing four times more work in parallel. But now we're back to the issue of having to hide the latency! We could unroll again, but that would require more registers...
...unless the unrolling is done at the hardware level. Take this code:
Code:
for(i = 0; i < 1000; i += 16)
{
v = a[i+0..15] * b[i+0..15];
c[i+0..15] = v + d[i+0..15];
}
The vectors are made four times wider again, but instead of performing 16 multiplications or additions in parallel, only 4 are performed each cycle. This means that by the time the first 4 multiplications have finished, the vector addition can start for the first 4 values, and every subsequent cycle the results are ready for the next 4 values. It's very similar to the above unrolled case, except that it doesn't explicitly use four separate instructions and four separate registers. It just uses one very wide register. Of course at the hardware level this can be 4 individual registers.
x86-64 only has 16 architectural registers visible to the software, but for instance Sandy Bridge has 144 renamed 256-bit physical registers. AVX-1024 would implicitly allow the software to access 64 256-bit registers for latency hiding without requiring additional spilling code.
No, the front-end contains several
queues, and as long as these queues are sufficiently full there's no risk of creating bubbles down the pipeline. Of course the queues themselves should not be clock gated. The AVX-1024 instructions take the same time as four AVX-256 instructions so the execution units are kept busy even if the front-end is regularly clock gated.