Hmmm I'm not sure I agree with this since I've seen tables that state point blank that a Athlon XP has a IPC count of 9 while the P4 with it's longer pipe has a IPC count of 6.
This is a common misconception...while the Athlon does issue up to 9 uops/cycle from its reorder buffers, and the P4 six, issue rate (or number of functional units, for that matter) does not uniquely determine IPC. Dynamically scheduled microarchitectures decouple the front-end fetch mechanism from the back-end scheduling, execution, and retirement mechanism; peak IPC is determined by the fetch and retire rate, which is often less than the peak issue rate. The P4 and Athlon, despite their difference in issue rate, are still both essentially 3-way fetch superscalar cores. While the Athlon fetches/decodes up to 3 x86 instructions/cycle into uops (average 1 to 2 uops/x86 instruction) and the P4 fetches 3 uops/cycle from its trace cache, they still both retire 3 uops/cycle despite their difference in issue rates from their reorder buffers. Note that the P3 issues 5 uops/cycle from its issue ports vs. the P4's 6 uops/cycle; the P4 certainly doesn't have a 20% higher IPC than the P3, nor the Athlon 80% higher. In terms of x86 instructions, even the Athlon rarely achieves an IPC higher than 1.2 x86 instructions/cycle.
In actuality, IPC is determined by fetch/schedule/issue/retire rate; number and organization of functional units; pipelined instruction latency; reorder window size; number of renaming registers; in-order vs. out-of-order execution; speculative vs. non-speculative execution; pipeline length/branch mispredict penalty; clock rate; branch prediction/branch target buffer organization and accuracy; multilevel cache size, bandwidth, latency, associativity, block size, replacement algorithms, write-through/write-back characteristic; main memory latency and bandwidth; ISA characteristics: number of logical registers, number of operands; the compiler; the software;...and the kitchen sink.
In a similar fashion, it is often misconceived that the Athlon's 3 FP units (FP move, FP add, FP multiply) are responsible for it's higher x87 FP performance over the P4, which has 2 FP units (FP move, FP add/multiply). Yet at the same time, the P3 has the same basic FP unit organization as the P4 (and actually can only issue one FP uop/cycle vs. the P4's two), and still was very competitive with the Athlon at the same clock rate. Likewise, the Alpha EV6 and EV7 have two FP units vs. the Athlon's three (though the former are more symmetric IIRC), while the EV7 at 1.2GHz may come close to doubling the SPECFP 2K performance of the 1.8 GHz Athlon XP. In practice, FP performance is more determined by reorder window size, cache and system bandwidth, pipelined instruction latencies, and ISA characteristics among other things. Also, in the case of the P4, it is limited to fetching 1 FXCH instruction/cycle out of its trace cache (vs. the P3 and Athlon's 3); the FXCH instruction is heavily used in modern x87 software to attempt to emulate x87's FP stack into a flat register file. I've also read that the P4 is more sensitive to memory data alignment than previous x86 cores.
edit:
I'll try to be a little more clear on the width of the P4 and Athlon's pipeline. For the front-end (fetch, decode (for the Athlon), register rename and dispatch), the Athlon can fetch up to 3 x86 instructions/cycle and decode them into 3 to 6 uops...a register-memory x86 arithmetic operation gets decoded into a single register-register arithmetic uop and a load/store uop, while a register-register x86 instruction gets decoded into a single uop. It can then rename (72 total renaming registers) and dispatch up to 3 uops/cycle to the back-end's reorder buffers. In contrast, the P4 fetches up to 3 uops/cycle (actually 6 uops every other cycle) from its trace cache, and renames (126 total renaming registers)/dispatches 3 uops/cycle into the back end.
In the back end (schedule, execution and retirement), the P4 has a reorder window size of 128 instructions vs. the Athlon's 72. The Athlon can issue 3 integer execution uops/cycle, 3 address generation uops/cycle (which then issue to a 2 uop/cycle load/store unit), and 3 FP uops/cycle (one FP move, one FP add, one FP multiply). The P4, in contrast, shares some issue ports between its integer and FP units. It can issue 6 uops/cycle from four ports, two of which issue 2 uops/cycle since the two "double-speed" ALUs can each issue two (dependent or non-dependent) arithmetic uops/cycle. The lesser-used non-ALU integer execution unit is "normal speed." Also, the P4's memory units handle both address generation and load/stores (unlike the Athlon's, which are seperate). Thus port-0 can issue a single uop in the first half of a clock cycle to either the first fast ALU or the FP-move unit; in the second half of a clock cycle, it can issue another uop to the fast ALU. Port 1 can issue a uop in the first half of a clock cycle to the second fast ALU, the "slow" ALU, or the FP-execute unit; likewise, in the second half of a clock cycle, it can issue another uop to the second fast ALU. Ports 3 and 4 for the load/store queue can respectively issue a load and a store uop each cycle. Finally, both the Athlon (IIRC) and P4 retire 3 uops/cycle.