If you understand the rationale behind, it is a beautiful concept: Increase integer horsepower, decrease floating point horsepower as those loads can be offloaded to the GPU...
That is a completely wrong assumption on multiple levels. First of all, you can't take an application and just magically "offload" the work to somewhere else. It takes way more than simply recompiling it too. Developers have to put real effort into splitting the work between the CPU and GPU and keeping things properly synchronized and minimizing the data transfers and maximizing the concurrency. None of that is a problem with homogeneous computing. Now, these hurdles with heterogeneous computing could be acceptable if there was a lot to gain from it, but in reality the computing power of the average GPU really isn't much higher than that of the CPU, if at all. For instance the mainstream Haswell GT2 will have close to 500 GFLOPS of computing power on the CPU cores, which is more than its iGPU!
So the huge mistake AMD is making is thinking that GPU silicon is somehow completely different from CPU silicon. It's not. Intel proves that it's perfectly possible to increase the throughput of CPU cores by using wider SIMD vectors and things like gather support. These are GPU-like features, but without the disadvantages of heterogeneous computing. On top of that Haswell also strengthens the multi-core scalability with TSX support.
Unfortunately, while the engineering department delivered, the developer relationships team didn't. By the time BD arrived, fusion wasn't cohesive enough as Open CL was just gaining traction, and many high profile applications were still poorly threaded. The software to make BD shine wasn't there. Couple that with a very strong CPU from the competition, and the weak spots of the new concept were exacerbated. But the software is getting there. More and more applications are getting better threaded, or better yet, using Open CL.
Don't blame developers. They just go for the highest ROI, so they won't adopt an awkward architecture which demands them to put effort into offloading workloads, that has a minor market share, while the competition is offering a solution with higher chances of success that requires a mere recompile.
And no, this situation isn't changing in their favor. Software may be getting more threaded, but Amdahl's Law shows that there's always sequential code that holds things back. So single-threaded performance is still hugely important to scale things in the future. Intel's solution is to have a very wide core which can run either one thread or two threads very efficiently. TSX is also going to be indispensable for efficient fine-grained concurrency. Lastly, the adoption of OpenCL doesn't favor AMD over Intel either. Haswell's gather support makes the CPU cores very efficient at executing OpenCL code (not to mention everything else that doubles the throughput). And because it's homogeneous, it doesn't suffer from moving data back and forth between the CPU and GPU.
Wonder if we would be having these discussion had the developer relationships team at AMD had done a better job helping optimize software.
Help them how? By cheering? The only thing that would have helped is to give developers cold hard cash to cover for the cost of optimizing for a less developer-friendly architecture. Not exactly a great idea to spend their cash that way, especially given their position. And due to the inherent inefficiency of heterogeneous computing and the very small difference in computing density there's no guarantee at all that the software would actually be faster.
Given the hardware they had to deal with, I think the developer relationship team did everything in their power. It just was, and still is, a shitty architecture. Mark my words; continuing to pursue HSA and ignoring AVX2+/TSX would be the death of AMD. Even if they change their mind, the damage done since buying ATI and following their pipe dream might already be irreparable.