Snipped from a post of mine about a month ago - it's directly cut'n'pasted from a discussion about the Pentium 4 so there are a few phrases that don't make complete sense in this context:
Let's take the example of a one-stage pipeline CPU. Making the numbers easy to work with (but unrealistically slow, so stick with me here), you might find that it take 1s to complete the instruction decode, the add operation and then write the result back to memory. Since the clock needs to wait for the data to be ready, you would find that you could clock this theoretical CPU at 1Hz. Now, if we could chop the logic neatly in half, we would now find that we can complete the instruction decode, one half of the add in 0.5s and then finish the add and write the result back to memory in another 0.5s. Nothing has really changed - it still takes 1s to complete one add operation, but now we can clock the design at 2Hz. So, if you have back to back instructions filling up the pipeline, we can now complete them twice as fast. In theory, we have doubled the performance of this theoretical CPU. This is pipelining.
But, if you don't take advantage of the ability to clock this new design at 2Hz instead of 1Hz, then what have you accomplished? Basically nothing. You are still finishing one thing per clock (assuming the pipeline is full), but you are clocking the thing at the exact same speed as before. Since there are plenty of things which make pipelining a CPU less than 100% efficient, then in reality you are really managed to cripple your design slightly by pipelining. At the lower clock frequency, it's actually slower than the original one.
This is why it's crazy to do clock for clock comparisions of CPUs with different pipelines. The Athlon has a 10-stage pipeline, the Pentium 4 has one that's 20 stages. If you clock the Pentium 4 at really slow frequencies for comparison of course it's going to to look bad. It's not supposed to run that slowly. The pipeline stage increase allows you to clock it faster, so it should be run faster for comparison otherwise you are purposely defeating the point of having a lot of pipeline stages.
But back to pipelining, you might think, "well, what's the limit? why can't we put in 100+ pipeline stages into a CPU and make it 100x faster?". Aside from the obvious one, there are plenty of reasons, and I'm not going to go into clock skew/uncertainty, CK -> Q vs. logic delays and other really in-depth stuff. The obvious reason is what everyone has mentioned, branch prediction. Let's say we have all of these instructions in the pipeline and one of them is a branch instruction... say we are comparing two numbers and if they are equal we will execute one section of code, and if they aren't equal then we will execute another. We want to fill the pipeline, but we won't know the outcome of the branch until later. What do we do? We make an educated guess, which in CPU terms is "branch prediction". If we get it right, the pipeline stays full and everything continues on like before. If we get it wrong, then we need to dump all the instructions that we started after the branch, and the load in the other branch. This is the big downside of pipelining. There are others, but this is the biggie. Since we can't always get branchs right, we will take a misprediction penalty when we screw up. So you definitely don't want a 40 stage pipeline, because then you may have to wait 38-39 ) cycles before everything is back to normal on a misprediction. Devices that don't really have branches to worry about (DSP's spring to my mind immediately), tend to have really long pipelines. Theoretical studies back in the early to mid 90's said that the practical limit for a CPU based on current branch prediction methods at the time was approx. 16 (Computer Architecture : A Quantitative Approach, Hennessy and Patterson). But the branch predictor on the Pentium 4 is pretty good, so it could push past this a little.
Patrick Mahoney
IPF Microprocessor Design
Intel Corp.