Bulldozer's shared floating-point cluster with FMA support indeed offers some advantages with legacy 128-bit instructions.
But that advantage will be short-lived once Intel adds FMA support. It would give Intel the same flexibility, and double the peak throughput at the same time. And once applications use 256-bit instructions, it doubles again!
Even if AMD doubles the width of each ALU to 256-bit, it still wouldn't match the same peak throughput per core. Also note that if the IGP is removed from Sandy Bridge, it creates room for two more cores at no extra cost (to Intel). So they can launch such an enthusiast part soon after Bulldozer, and steal AMD's thunder.
BD has double the FP peak throughput of SB when legacy SSE and
AVX128 codes are used and is as powerfull with AVX256.
Once intel adds FMA?...on the fly ?..
It will only be the case with Haswell which is in two years...
By the time , AMD can double their FPU cluster wich will
then be 2X256bit , since it s already a 258bit/one cycle FPU.
Also, it s not possible to run code using only AVX256 , whatever
the compilers s efficencies, so i guess that code being mixed
and interleaved, BD will be up to a 6C SBE even on FP.
