Take the example of a one-stage pipeline CPU. Making the numbers easy to work with (but unrealistically slow, so stick with me here), you might find that it take 1s to complete the instruction decode, the add operation and then write the result back to memory. Since the clock needs to wait for the data to be ready, you would find that you could clock this theoretical CPU at 1Hz. Now, if we could chop the logic neatly in half, we would now find that we can complete the instruction decode, one half of the add in 0.5s and then finish the add and write the result back to memory in another 0.5s. Nothing has really changed - it still takes 1s to complete one add operation, but now we can clock the design at 2Hz. So, if you have back to back instructions filling up the pipeline, we can now complete them twice as fast. In theory, we have doubled the performance of this theoretical CPU. This is pipelining.
You might think, "well, what's the limit? why can't we put in 100+ pipeline stages into a CPU and make it 100x faster?". Aside from the obvious one, there are plenty of other reasons, but I'm not going to go into clock skew/uncertainty, CK -> Q vs. logic delays and other really in-depth stuff. The obvious reason is branch prediction. Let's say we have all of these instructions in the pipeline and one of them is a branch instruction... say we are comparing two numbers and if they are equal we will execute one section of code, and if they aren't equal then we will execute another. We want to fill the pipeline, but we won't know the outcome of the branch until later. What do we do? We make an educated guess, which in CPU terms is "branch prediction". If we get it right, the pipeline stays full and everything continues on like before. If we get it wrong, then we need to dump all the instructions that we started after the branch, and the load in the other branch. This is the big downside of pipelining. There are others, but this is the biggie. Since we can't always get branches right, we will take a misprediction penalty when we get it wrong. So you definitely don't want a 60 stage pipeline, because then you may have to wait 59 cycles before everything is back to normal on a misprediction.
Patrick Mahoney
IPF Microprocessor Design
Intel Corp.