Originally posted by: Shanti
Sohcan,
Not disputing your info, I'm just a little confused.
I know the peak IPC is 6 for the P4 and 9 for the Athlon XP, as you stated. But what I don't understand is when you say they are both limited to a peak issue and dispatch rate of 3.
6 and 9 uops/cycle is not their respective peak IPC, it's their peak dispatch/execution rate. Absolute peak IPC is determined by the peak fetch/issue and retire rate, which on any implemented microprocessor is always equal to or lower than the peak dispatch rate.
In a dynamically scheduled microprocessor (like any modern high-performance microprocessor except the Sun US-III and Itanium 1/2), instructions are fetched, decoded, and issued in program order....during the issue stage register dependencies are checked and register naming is performed, in which the x86 logical registers used in the instructions are mapped to a larger set of physical registers. From there the instructions are sent to reservation stations, where they sit until their operand values are ready and the instruction can be sent to the execution units out of program order. It is out of the reservation stations to the execution units that the peak dispatch rate can be higher, ie 6 uops/cycle for the P4 and 9 uops/cycle for the Athlon. After execution, the instructions are buffered in a reorder buffer (a space is reserved for an instruction during issue), where the results of the instruction are written to the register file or memory in program order. Thus the absolute peak performance is determined by the fetch/issue and retire rate, which for the Athon and P4 are equal. The Athlon can decode more uops/cycle than the P4, but they both issue and retire 3 uops/cycle.
So using peak dispatch rate to judge IPC is rather fruitless, since the other microarchitectural parameters have a much greater effect. Having a higher peak dispatch rate can help dispatch more instructions/cycle out of the reservation stations after the reservation stations fill up due to a cache miss that goes to main memory, but in the grand scheme of things this buys very little extra performance since you are still limited by the lower retire rate.
I assumed the main reason that the XP significantly outperforms the P4 at equal clock rates was the higher IPC. If this is not the case, why is there such a discrepancy in clock for clock performance.
This is true, the XP tends to yield a higher IPC than the P4 on most workloads. What I was clarifying is that IPC is a function of a large number of parameters, and cannot be universally defined by a single number, ie the P4 has X IPC and the Athlon has Y IPC. Aside from the fact that the IPC can vary widely from one program to the next (between 0.5 to 1.5 on most common CPU-intensive desktop workloads) since software characteristics affect miss-rates in caches, TLBs, branch prediction and the branch target buffer (the buffer that predicts which instruction address from which to fetch in the following clock cycle), the relative IPC between two microprocessors vary as well.
I have read lots of articles basically stating that performance can be roughly measured by clock speed X IPC. They point to the P4 and XP as an example of this. A rough generalization: 1400 Mhz x 9 IPC = 12600 and 2000 Mhz x 6 = 12000 would be a good reason that a 1400 Mhz XP is close to the same performance as a 2000 Mhz P4. This seems to be about right with the Willamette, but I guess it doesn't really fit with northwood. An 1400 Mhz XP would match up well with a 2000 Mhz Willamette but not with the 2000 Mhz Northwood.
Can you explain a little further why the clock for clock performance discrepancy if it's not because of a difference in IPC?
Well, the accurate equation for performance is execution time = # instructions * CPI (inverse of IPC) * clock cycle time. Assuming that two processors use the same compiler, compiler optimizations, and instruction set, you can ignore # of instructions. To compare performance between the P4 and XP using IPC and clock rate, you can't use peak dispatch rate...especially since their respective dispatch rates of 9 and 6 instructions/cycle is FAR higher than average IPC for most programs, which is between 0.5 and 1.5 x86 instructions/cycle on both the P4 and Athlon. I can go over more of this later (I have to head off now), but the generally lower IPC of the P4 is due to other factors, including longer branch misprediction penalty, higher clock rate (because the P4 is clocked higher, memory latency with respect to the CPU clock cycle time is higher), pecularities in which it handles FP instructions, and a number of other reasons.