Pipeline Question

Lvsheng

Member
Mar 9, 2001
54
0
0
Anyone know why a longer pipeline will allow a chip to operate at a higher frequency?
 

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M
Oct 9, 1999
13,141
16
81
You really want to ask all the tough questions huh?

What do you need to do when processing an instruction? In other words, what happens in a pipeline? The pipeline breaks up the processing of an instruction into smaller stages. In each clock cycle the instruction will move through the pipeline to the next stage. A typical pipeline consists of something like Fetch, Decode, Rename, Dispatch, Execute, Retire.

Doing one of these stages takes time....time for the processor to do its work. If you have more stages, the amount of work needed to be done in one clock cycle is less...what is known as a lower Instruction Per Clock Cycle. It's easier for the processor to work on, so fewer gates per stage are needed. The lower number of gates means that the clock speed can be increased without compromising stability or generating too much heat.

Hope that answers your question.
 

Lvsheng

Member
Mar 9, 2001
54
0
0
If you have more stages, the amount of work needed to be done in one clock cycle is less

I can't understand this phrase. I have saw many p4 review saying that p4 actually doing less job than p3 at the same clock speed. Eg. If a p3 running at 1.3GHz and a p4 running at 1.3GHz, the p3 will outperform p4 because it actually doing more job. p4 does less job because of its longer pipeline, but I dun understand this.

What more job less job?

So take p4 as example, it has 20 stages pipeline. So as what you mentioned earlier, p4 break an instruction into various stages, right? Then it execute them simultanenously right? After what process continue?

Pls explain the more job less job part, that is the important point.
 

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M
Oct 9, 1999
13,141
16
81
You appear to have misunderstood the term &quot;pipeline&quot;. What happens in a pipe? It flows from one part to another. The various stages are NOT executed simultaneously on a single instruction.

In one &quot;tick&quot; of the clock, an instruction will move through one part of the pipeline and then progress to the next stage in the next &quot;tick&quot; of the clock. Take my very short 6 stage pipeline example that I gave above: Fetch, Decode, Rename, Dispatch, Execute, Retire.

Clock cycle 1: Fetch
Clock cycle 2: Decode
Clock cycle 3: Rename
Clock cycle 4: Dispatch
Clock cycle 5: Execute
Clock cycle 6: Retire

As you can see, it takes six clock cycles to complete an instruction.

Of course, it would be silly to have just one instruction going through the pipeline at a time. The processor can line up another instruction right behind the first, and have a total 6 instructions going all at once through the pipeline at various stages of execution.

If I had a longer, 20 stage pipeline, it would take 20 clock cycles to completely finish an instruction. With 20 stages in the pipeline, the Pentium 4 does approximately 1/3 of the work in a single clock cycle that my 6 stage processor does.
 

Lvsheng

Member
Mar 9, 2001
54
0
0
Ok, get some picture now, but need to clarify first.

In the first clock cycle:
In your 6 stages pipeline, the FETCH will enter the first pipeline stage.
Then the 2,3,4,5,6 pipeline stages will be reserve for other instruction.

In the second clock cycle:
The Decode will come in and occupy one stages and 5 remaining will be used by other instruction.

And follow on...

Is this how the pipeline works?
 

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M
Oct 9, 1999
13,141
16
81
Instruction A will be in the Fetch stage in clock1.
Instruction A will be in Decode in clock2; Instruction B can move into Fetch.
Instruction A will be in Rename in clock 3; Instruction B will be in Decode, Instruction C can go into Fetch.

And so on.

Makes sense?
 

Lvsheng

Member
Mar 9, 2001
54
0
0
Yeah, make very good sence, now get a good pic of how this pipeline works already. So isn't that p4 will be very slow if the same instruction on it will take 20 clock to complete whereas p3 only take 10 clock to complete?

Then that means we can't always make the pipeline longer to increase clock speed right? If not those chip will be actually slower and slower (if the clock speed remain).
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
Snipped from a post of mine about a month ago - it's directly cut'n'pasted from a discussion about the Pentium 4 so there are a few phrases that don't make complete sense in this context:

Let's take the example of a one-stage pipeline CPU. Making the numbers easy to work with (but unrealistically slow, so stick with me here), you might find that it take 1s to complete the instruction decode, the add operation and then write the result back to memory. Since the clock needs to wait for the data to be ready, you would find that you could clock this theoretical CPU at 1Hz. Now, if we could chop the logic neatly in half, we would now find that we can complete the instruction decode, one half of the add in 0.5s and then finish the add and write the result back to memory in another 0.5s. Nothing has really changed - it still takes 1s to complete one add operation, but now we can clock the design at 2Hz. So, if you have back to back instructions filling up the pipeline, we can now complete them twice as fast. In theory, we have doubled the performance of this theoretical CPU. This is pipelining.

But, if you don't take advantage of the ability to clock this new design at 2Hz instead of 1Hz, then what have you accomplished? Basically nothing. You are still finishing one thing per clock (assuming the pipeline is full), but you are clocking the thing at the exact same speed as before. Since there are plenty of things which make pipelining a CPU less than 100% efficient, then in reality you are really managed to cripple your design slightly by pipelining. At the lower clock frequency, it's actually slower than the original one.

This is why it's crazy to do clock for clock comparisions of CPUs with different pipelines. The Athlon has a 10-stage pipeline, the Pentium 4 has one that's 20 stages. If you clock the Pentium 4 at really slow frequencies for comparison of course it's going to to look bad. It's not supposed to run that slowly. The pipeline stage increase allows you to clock it faster, so it should be run faster for comparison otherwise you are purposely defeating the point of having a lot of pipeline stages.

But back to pipelining, you might think, &quot;well, what's the limit? why can't we put in 100+ pipeline stages into a CPU and make it 100x faster?&quot;. Aside from the obvious one, there are plenty of reasons, and I'm not going to go into clock skew/uncertainty, CK -> Q vs. logic delays and other really in-depth stuff. The obvious reason is what everyone has mentioned, branch prediction. Let's say we have all of these instructions in the pipeline and one of them is a branch instruction... say we are comparing two numbers and if they are equal we will execute one section of code, and if they aren't equal then we will execute another. We want to fill the pipeline, but we won't know the outcome of the branch until later. What do we do? We make an educated guess, which in CPU terms is &quot;branch prediction&quot;. If we get it right, the pipeline stays full and everything continues on like before. If we get it wrong, then we need to dump all the instructions that we started after the branch, and the load in the other branch. This is the big downside of pipelining. There are others, but this is the biggie. Since we can't always get branchs right, we will take a misprediction penalty when we screw up. So you definitely don't want a 40 stage pipeline, because then you may have to wait 38-39 ) cycles before everything is back to normal on a misprediction. Devices that don't really have branches to worry about (DSP's spring to my mind immediately), tend to have really long pipelines. Theoretical studies back in the early to mid 90's said that the practical limit for a CPU based on current branch prediction methods at the time was approx. 16 (Computer Architecture : A Quantitative Approach, Hennessy and Patterson). But the branch predictor on the Pentium 4 is pretty good, so it could push past this a little.

Patrick Mahoney
IPF Microprocessor Design
Intel Corp.
 

Lvsheng

Member
Mar 9, 2001
54
0
0
Okay, now I get a good pic now. Thanks for the explaination, pm. Wow you work for Intel? You must be a rocket scientist.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76


<< You must be a rocket scientist >>


More likely a MPU designer, dont think they make a whole lot of rocket designs at Intel ;)

Just a tip, if you want serious in-depth articles, have a look at AcesHardware.

They have a lot of really good atricles about stuff like this, and the techincal board is solely for discussions about it.
Probabaly the best place to go for this type of stuff(which is of course not to say that pm cant explain it very well as well:)).