Okay, nevermind then
AFAIK, a fastpath double instruction was issued, and there was nothing stopping it from executing on two SIMD ports in parallel, so long as they were available.
But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.
I had a reread of agners mircoarch pdf again, seems your right in that mops can go to any available unit for some reason i was sure that they both went to the same unit back to back. The entire FPU section is actually a really good read in relation to what we might see get better in Zen, stuff like:
The data cache has two 128-bit ports which can be used for either read or write. This
means that it can do two reads or one read and one write in the same clock cycle.
The measured throughput is two reads or one read and one write per clock cycle when only
one thread is active. We would not expect the throughput to be less when multiple threads
are active because each core has separate load/store units and level-1 data cache. But my
measurements indicate that level-1 cache throughput is several times lower when multiple
threads are running, even if the threads are running in different units that do not share any
level-1 or level-2 cache. This phenomenon is seen on both Bulldozer, Piledriver and
Steamroller. No explanation for this effect has been found. Level-2 cache throughput is
shared between two threads running in the same unit, but not affected by threads running in