Originally posted by: Fox5
I think the P4 only does 3 instructions per clock cycle, and the athlon 9, however in most circumstances the athlon is only capable of doing 6, and the P4 only one. In actual performance, athlon 64s do about 1, athlon xps a bit less, and p4s a bit less than that. I think the main thing that limits it is memory and cache performance, thus why the Pentium M's efficient cache helps a lot, and why the athlon 64's integrated memory controller helps a lot.
The P4 has 2 integer units and 2 floating-point units, along with 2 AGUs, giving a total of 6 execution units.
'Double-pumping' of the ALUs was proven necessary to compete with the integer performance of the P!!!.
The Athlon (K7 & K8) has 3 integer and 3 floating-point units, with 3 AGUs, giving a total of 9 execution units.
It can thus be described as 50%
wider than the P4, and is able to execute 50% more instructions per clock, theoretically.
As you say, it is difficult to keep those execution units full all the time, but the situation isn't quite that bad.
The K8 introduced a 'pick' stage into the Athlon's pipeline that analyses code for dependencies and reschedules instructions accordingly, allowing the Athlon to acheive IPC closer to its theoretical maximum (which, incidently, is the reason for adding larger caches).