BTW, what is the base of the 40% increase? Average, lowest, or highest XV IPC in many or some benchmarks? Or SPEC CPU as usual?
Doing two 256b AVX ops per cycle vs. 128b doesn't increase IPC (technically).
Oh, I wish we knew!
I've been working entirely on IPC being approximately equal to nominal instructions retired per cycle, which is the easiest way for a CPU engineer to envision performance.
Hopefully the design doesn't require SMT to be at play to realize that 40% boost. If so, single threaded performance could see a mere 10% or so boost... but, we'd get that just from dropping the module pipeline overhead... and we'd get more still from going with wider cores... So it would only make sense that AMD was aiming at a true 40% per core, single threaded, per cycle improvement.
On the FPU front, with AMD, doubling the FPU, itself, allows more operations to be executed at once.
It's really all about the addressable execution units, rather than an "FPU." Bulldozer's FPU is said to be 128x2, but is, in reality, 4x64 FMAC + 4x64 FADD...
Each 64-bit unit can perform one operation with two 32-bit floating point values, or can operate on one 64-bit float. But it, really, all comes down to what the FPU scheduler can support, as well as the FPU's result storage mechanism's capabilities. Though, these have long been able to handle their execution units pretty optimally.
So, Bulldozer can, in theory, do the following with the same complexity:
8x 32-bit FLOPs
4x 64-bit FLOPs
2x 128-bit FLOPs
1x 256-bit FLOPs
Due to the expense in ganging the two halves of the 256-bit FPU, which might be split between two threads in a CMT design, I would suspect there would be another cycle lost to lock the other half for execution. The rest should have the same cost.
Of course, not all FLOPs are created equal, FADD vs FMAC units are a whole other discussion.
Zen, if it is 512-bit, will likely not have the ganging overhead, as I'd suspect the SMT overhead would be handled with a flag in the pipeline. It should be capable of the following:
16x 32-bit FLOPs
8x 64-bit FLOPs
4x 128-bit FLOPs
2x 256-bit FLOPs
1x 512-bit FLOPs
This would put it on par with Intel, capability wise. Bulldozer's FPU is on par with Sandy Bridge's, believe it or not, but they have so much module overhead (more than I talk about here) that things can get quite ugly, quickly.