I think the P4 only does 3 instructions per clock cycle, and the athlon 9, however in most circumstances the athlon is only capable of doing 6, and the P4 only one. In actual performance, athlon 64s do about 1, athlon xps a bit less, and p4s a bit less than that. I think the main thing that limits it is memory and cache performance, thus why the Pentium M's efficient cache helps a lot, and why the athlon 64's integrated memory controller helps a lot.