Maybe you're relying too much on K7 then.
K8 has three (different) 64-bit wide FP execution pipelines available to the sheduler, which is what counts.
FPMUL, FPADD, FPMISC.
A 128-bit SIMD FP instruction has a throughput of 2 cycles, 2 X 64-bit X 1cycle.
These are all 4 stages deep.
As for actual hardware processing units, connected to the execution pipelines' ports:
FPMUL:
There is one 64-, 80- and 2X32-bit unit for multiplication, divide and square root.
This one handles X87 and SSE2 instructions.
FPADD:
One 64-, 80-bit fp unit for adds and subs.
FPMUL and FPADD:
Then there is one 2X32 FPMUL + 2X32 FPADD for 3DNow and SSE. This can handle both one mul and one add simultaneously, but not two muls or two adds. So 128-bit throughput is still 2 cycles.
FPMISC:
One 64-, 80-bit FP store and misc unit. This handles stores, contains pi, e.., and performs more complex micro coded operations.
Then, as FP-pipes also handle SIMD integer:
FPMUL and FPADD:
Also one 2X64 integer store, add, logic unit for 128-bit SIMD instructions. Again this can serve both FPMUL and FPADD simultaneously. Here, packed 128-bit throughput can be just one cycle, as I believe all instructions can be sheduled into both pipes. But this is integer, not FP.
FPMUL:
One 64-bit integer mul for SIMD instructions. I'm afraid we're back to 2 cycles for 128-bit.