Originally posted by: pm
Doesn't Core 2 Duo achieve something similar to a 3 operand FMAC instruction by using macro-op fusion?
This is a serious question by the way. I am not any kind of microprocessor architect type person and don't pretend to be.
It may to some extent. As I understand it, MCW can only fuse certain kinds of operations though (e.g. compare + jump), and I'm not sure if the macro op fusion just saves slots in various queues and the reorder buffer or if it can actually chain operations reducing the overall latency (I don't think there's any savings available for integer ops, but for FP ops, you might be able to reduce the overhead of aligning and rounding between instructions).
Even if it can combine fadd and fmul or SSE equivalents at reduced latency, there's still another advantage the 3-operand instructions have: fetch bandwidth (and, I guess, I-cache usage). Keeping a fast SIMD backend full requires shoveling huge amounts of instruction data into the frontend of the machine, because a lot of newer instructions have longer opcodes (plus prefixes). As the average instruction size increases, combining instructions becomes a bigger win. In the case of the example
here, mulps + addps + shufps + movaps= 8 bytes for the "kernel" in the second half, while fmaddps + permps = 6 bytes (ignoring prefixes, and assuming I'm counting correctly).