And to further clarify directly:
Also there are some features AMD downplayed so far in my opinion. It is because obviously AMD has not only 2 FPU pipes and 2 MMX pipes. Those MMX pipes don't do MMX they are full 128 Bit integer SSE pipelines
(true).
So all register moves and load/stores can be executed also in those two pipelines
(not really, reg-reg moves for SSE and AVX-128 can be done with mov-elimination
Load – doesn’t actually require an execution pipe in the FP at all – but is limited to 2 128b loads/cycle max throughput.
Store – does take an execution pipe, but can only execute down 1 of the pipes. That & LS restrictions limit it to 1 128b store/c throughput)
I recently read a source that those two don't do 64 Bit MMX but 128 Bit SSE! Really don't know why AMD was so quiet about that so far and obfuscated that by using the wrong term "MMX". Therefore AMD can do 4 * 128 Bit SSE/cycle!
(yes, “MMX” is likely a bad name to use in describing the BullDozer micro-architecture and is somewhat misleading. Yes, we can do 4 128b arithmetic operations/cycle: 2 “floating-point” and 2 “SSE/AVX-128 integer”. Or/instead/in-combination we can also do 2 x87 “floating point” and 2 mmx “integer” per cycle – and by mmx I really mean the architected “mmx”

.
And that is the sound of me clapping my hands like a blackjack dealer and saying "all done", can't get any further into this topic.