they probably tried but the Cell is either: (1) not much more powerful against the 3.2 GHz PowerPC Tri-Core Xenon in the Xbox360, (2) extremely difficult and expensive to code for to actually achieve that theoretical performance.
It's both, but #2 is more true. You can't make a huge team make better code, so it takes more expertise, instead, a ton more time, and it would be hard to make something generally applicable in an engine, v. per-game. The small local memory, and high latencies, and performance killers, and require much more time and specialized work to fix, but a whole lot less to just ignore. When a developer hits on a suitable test case, the Cell can help make things better. But that's as much luck as anything else.
Partly because of that, most console physics is done using the main core's Altivec, with some VMX here and there, which is infinitely easier to program, can make good use of the high GHz, and have been superior to x86's vector extensions per core/clock for a very long time (one of several good reasons not to use x86, though AVX2 has some serious promise). It can probably be minimally affected by the small caches, too, though I'm not 100% on that.
For GFX, there is also (3) that Xenos was generally superior to RSX, and the limited use of eDRAM can do a lot to help make up for the limited memory bandwidth.
i am really under the impression that the APUs were inspired by cell...
Nah. The Cell is a miniaturized mainframe all over again, with multiple statically allocated SPUs and no hardware multitasking (S and in special-function, not synergistic). Just like narrow in-order processors, and unusual memory schemes, some things that let hardware be stupid and fast keep being allowed to come up to the surface and be tested out in the market. Then they do it, and learn why they hadn't been doing it, and it doesn't happen again for another 10-15 years (by the same group of management, anyway). I think it could have evolved to something much better (smaller processes would allow a lot to be added, while still shrinking total size), but the bad PR from the PS3 pretty much killed it.
Fusion came about because AMD did not have the high-efficiency vector processing expertise to do it in their CPUs, so it is more efficient do it by merging CPU and GPU over time, rather than building out from their CPU. Ultimately, the idea is that the regular processing and graphics will be different in that running graphics code will turn power on to raster-only units on the CPU, and probably use a different algorithm for prioritizing DRAM memory accesses.
Not long after researchers started hacking GPUs for parallel computing, that the GPU itself was eventually going to become an artifact of large processors, not unlike outboard FPUs, MMUs, cache, etc., was a foregone conclusion. The question is how to do it with limited knowledge and resources.
maybe, with AVX-2, haswell will have the same power of those SPEs (or better)...
In practice, it absolutely will. Any C/C++ programmer that knows loop X can be run fully in parallel can make that happen, and there are none of the downsides the Cell had, like local memory, shared high-latency buses, and slow-bandwidth unit-to-unit communication. I've argued that existing code, and incidental performance improvements won't be much higher than in the past, because compilers are really stupid. But, for any new code being made with support for AVX2 in mind, it'll be duck soup...or worth the effort, if it's more difficult.