BenchPress
Senior member
- Nov 8, 2011
- 392
- 0
- 0
Pipeline width typically refers to the issue width; the number of instructions it can execute in parallel. This isn't expected to change with Haswell, since increasing it is prohibitively expensive. Instead, the instructions themselves become wider. AVX2 features 256-bit vectors, which can be eight 32-bit values, or many other combinations of element count and element type which fits in 256-bit. Haswell should have three such 256-bit execution units, so 3 is the arithmetic issue width.It seems to me the whole point of AVX2 is to enable the CPU to perform more parallel functions, thereby widening the pipeline. Correct or not?
GPUs typically have at least 512-bit vector units, and the issue width varies from 1/4 to 2 (fractional numbers meaning it takes multiple cycles to process an entire vector). And while they settle for a lower clock speed, they do often have more cores.
The computing density of a CPU with AVX2 won't differ very much from a GPU though. Which begs the question whether we should even rely on the GPU at all for general purpose computing, or if we just need more CPU cores. Homogeneous computing on the CPU is inherently more efficient since the data doesn't have to be moved back and forth to/from the GPU.
That would be the 'effective' number of compute cores for 32-bit data, yes. Although you have to keep in mind that it's running at about three times the clock speed of most GPUs.From what I've read in this and other threads it appears that Haswell will have 64 gpu-like cores to widen the pipeline.
Not really. There is nothing that "requires" more cores. A single-core is better than two cores at half the frequency. There's always a portion of a task that can't be parallelized, and Amdahl's law shows that more cores become increasingly less effective. GPUs have been very successful for graphics because graphics has a great level of parallelism, but general purpose tasks often have far less parallelism so they're better off with fewer, but faster cores.Now, unless I'm mistaken, I think these can still coexist just fine. Haswell will handle parallelism up to 64 'whatevers' wide and the GPU will take over and run anything requiring more cores than that.
Make sense?
Also this isn't actually the entire picture yet. A GPU doesn't have out-of-order execution, so a thread will always stall when trying to execute dependent instructions or loading data from memory. It's only because it runs hundreds of threads per core which it constantly switches between that it can achieve high throughput. But this means a GPU needs a truely massive amount of parallelism in the workload to achieve good efficiency. And because there are so many stalled threads which each need a set of registers, the GPU can actually run out of registers when processing complex tasks.
What's more, these threads have to share the caches too, and so the data locality is pretty horrible, often leading to slow RAM accesses instead of reading things from cache. Fortunately discrete GPUs have lots of bandwidth, but that's not the case for APUs. There are clever compression techniques for graphics, but not for general purpose workloads.
This greatly depends on the task. Graphics essentially doesn't have to communicate anything back to the CPU, it's just sent straight to your monitor. But something like physics calculations would have to be retrieved back by the CPU, and there can be an order of magnitude difference in performance for an APU versus an equivalent discrete GPU.Also, one question - how much advantage do AMD's APUs get by being on-die with the GPU? In other words, if you compared an APU to an equivalent discrete GPU/CPU pair (same cores/speeds) how much better throughput would the APU offer?
Either way homogeneous computing with AVX2 doesn't suffer at all from the latency or bandwidth bottleneck that GPGPU faces.