Alright, I've collected some hard data to get a better picture of how homogeneous and heterogeneous computing will compare...
Sandy/Ivy Bridge's emulation of a gather operation currently takes:
- 8 uops on port 0 (ALU)
- 6 uops on port 1 (ALU)
- 4 uops on port 2 (load)
- 4 uops on port 3 (load)
- 0 uops on port 4 (store)
- 8 uops on port 5 (ALU)
Note that all the ALU ports are pretty swamped. The maximum reciprocal throughput is 8 cycles, and there's only two unused cycles on port 1. In theory this means you should be able to squeeze in something like two additions, but in practice everything I tested caused the throughput to go down significantly. Also in practice it would be very rare to be able to freely add more arithmetic workload, even more so things that only use port 1.
So this irrefutably proves that emulating gather has bad throughput and no matter how Haswell implements the dedicated gather support it will be a profound improvement. The
worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles, allowing Haswell's gather performance to keep up with the doubling of the arithmetic throughput. So the worst option is still good enough.
But given that Intel focuses on performance/Watt it doesn't make sense to always fetch 4096-bit worth of cache lines for a 256-bit gather. Also it wouldn't make sense to send 256-bit index registers data to
each load port, four times. Also Larrabee/MIC was presented as getting support for collecting any number of elements from one cache line each cycle. Hence I'm sticking with the widely supported speculation that Haswell will have a single 256-bit gather port (and a 256-bit regular load port) capable of sustaining one gather instruction each cycle when the elements are in the same cache line. It could use the second load port to pass back the updated mask register!
I've seriously considered the possibility that they'd use the permute unit (currently bound to port 5), but there's no clear way how it could support unaligned elements, how it would write back the mask register, and how it could issue a varying number of load operations.
So feel free to prove me wrong with other hard data or by presenting an alternative gather implementation which supports all its features, but there are clearly strong arguments to support that Haswell's gather implementation will be much more efficient than emulating it as done today.