Quick question about pipelines

LeftSide

Member
Nov 17, 2003
129
0
0
Ok, all these talks about pipelines have me confused...
When the prescott has 31 piplines, does it take 31 clock cycles to get 1 thread through? Or is it just 1 clock cycle broken into 31 stages? I have heard both, but I think that it takes 31 clock cycles. This would explain why AMD gets so much more done with only a 15 stage pipeline.

Please feed me information, and solve the Pipeline mystery once and for all!!!
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
It does tak 31 clocks to get one instruction through the pipeline. THough in an ideal world a 15 stage pipeline woudl perform identical to a 31 stage pipline. If you fill a pipeline you have (in prescott's case) 31 instructions going at once. So one instruction is finished per clock. It is only when you cannot fill the pipeline when performance begins to drop
 

sao123

Lifer
May 27, 2002
12,653
205
106
branching must be predicted...this is the true nemesis of long pipelining.
branch prediction algorithms are very difficult to become very accurate. The longer the pipeling, the longer the stall then a misprediction happens.
 

aka1nas

Diamond Member
Aug 30, 2001
4,335
1
0
Or to expand a bit, the longer pipeline is only "slower" for the first 30 clockcycles while the pipeline is still partially empty, after that it starts cranking out instructions as fast more or less as the shorter pipeline, provided that it doesn't have a branch misprediction.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Lynx516
It does tak 31 clocks to get one instruction through the pipeline. THough in an ideal world a 15 stage pipeline woudl perform identical to a 31 stage pipline. If you fill a pipeline you have (in prescott's case) 31 instructions going at once. So one instruction is finished per clock. It is only when you cannot fill the pipeline when performance begins to drop

No. In an ideal world, the 31 stage pipeline would be faster. Both pipelines would be retiring one instruction per cycle, but the 31 stage pipeline would be clocked about twice as fast.

The problem is that the world isn't ideal, and branches exist (something like 1 in every 5 instructions, according to various professors), so if you can predict 95% of them correctly, and a 30 stage pipeline has 6 branches in flight, you're going to be wasting a lot of time following incorrect paths.
 

Brucmack

Junior Member
Oct 4, 2002
21
0
0
Originally posted by: Lynx516
I was using the idea world where transistor propergation didnt exist.

If everything happens in 0 seconds, then you've just simplified to the point where it doesn't matter how many stages you have anymore... At an infinite clock speed everything's going to get done pretty quickly :)
 

Matthew Daws

Member
Oct 19, 1999
31
0
0
It's not just branches which cause a problem, but also pipeline stalls. These occur when an instruction in the pipeline needs the result of an instruction further up the pipeline which hasn't finished yet. Think about multiplying two numbers and adding a third: you need the result of the multiply before you can do the add (and many instructions alter the CPU flags which may or may not affect how subsequence instructions behave). If the pipeline is shorter, then it takes fewer cycles to get the result out and allow the rest of the pipeline to continue. This is what Out Of Order Execution is all about, when the scheduling part of the CPU tries to re-arrange the order instructions are put into the pipeline. This is also why RISC processor have a lot of registers, to allow programmers (or compilers) to not overuse the same register (and why both the Athlon and P4 have a lot of hidden registers and on-the-fly register renaming).

Of course, as CTho9305 points out, the longer pipeline of the P4 allows it to hit much higher clock speeds, and in an ideal case (e.g. multimedia streaming using SSE2 stuff) it is a lot faster.

--Matt