Yes, I remember this. Many thought it would have been better if Intel didn't publish this. "GPU's arent 10 to 100X faster than our i7 960 CPU's, they are only 2.5x faster." Mind you, that was addressing a GTX280, (GT200), not even Fermi. In hindsight, Intel probably wished they hadn't released that.
Yeah...I am not technically inclined enough to know specifically what they were getting at, and since I cannot find a free copy I cant even read it. My guess would be over time that 2.5x would start to spiral upwards, at least until Intel gets on its own GPU bandwagon and then it can say...well we saw that and so now we added it! no pie on face that way (or less of it anyways).
On a general note it is not surprising that GPU's are faster for certain data applications, but I would disagree with Scali that "But I'd say on average the GPU has already won the performance/watt over CPUs."
GPU's only win the performance per watt award when the data is of a certain type that they excel at. When the data is not their type, they fail miserably. Computing tasks will always be some combo of true serial data, mostly serial but able to have some done parallel with lots of coding effort and that data which is embarrassingly parallel. It is the rare real world data set that is exclusively one of these types of data sets (although rasterization is one example that is embarrassingly parallel).
For the current crop of GPU's to be monsters the data has to have certain features:
1. Parallel throughout most or all of the data set. Things that only a portion are parallel do not work very well and incur significant performance penalties when run on a GPU.
2. Little or no branching, or
3. if they do branch, they need to branch the same way. Because the shades work in groups (i.e. "CUDA cores"), when a branch diverges it causes a major stall. Not to mention the lack of branch prediction and mis-prediction logic and buffering abilities that are lacking in most GPU's.
4. The data needs to fit into the ram on the GPU card. Latency and access penalties are very high for a GPU compared to a CPU right now. Hence why the Tesla cards have several GB of very fast ram onboard. But, there is a limit and it makes the cost skyrocket. Intel server systems with 32GB ram are easy, 32GB tesla cards....not so much.
5. Problem should be single precision at least until DP work catches up in hardware development cycles.
So data that fits these rough parameters will be exceptional on a GPU, most other data types will not. For example, try using a GPU as your primary processor for a relational database and it will not work well at all to the point where the CPU is now the one with multiple times performance advantages.
Or even look at certain specialized chips from Ti or others, they can build custom, specialized chips that can run a gigabit router and switch that traffic with ease. Try replacing a that switch with an i7 or Fermi and watch your switch fail.