Why do longer pipelines allow higher clock speeds?

Woodchuck2000

Golden Member
Jan 20, 2002
1,632
1
0
Having just read the Banias technology article, I realised that I havn't got a clue why increasing the length of a pipeline allows higher frequencies. Anyone care to explain?
 

rimshaker

Senior member
Dec 7, 2001
722
0
0
In a nutshell, in order for the cpu core to run efficiently, you always want the pipeline full. With a longer pipe, it takes a lot more clock cycles to fill it up. Of course the big downside is if there's a miss in the data stream. The entire pipeline has to be flushed out and repacked. It's a big hit when the pipeline is 20 stages. But 20stages allows for a big clock speed headroom.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: rimshaker
In a nutshell, in order for the cpu core to run efficiently, you always want the pipeline full. With a longer pipe, it takes a lot more clock cycles to fill it up. Of course the big downside is if there's a miss in the data stream. The entire pipeline has to be flushed out and repacked. It's a big hit when the pipeline is 20 stages. But 20stages allows for a big clock speed headroom.

That doesn't really explain why a longer pipeline lets you clock higher. I gave an explanation in the quantum computing thread...

Lets take A VERY simple processor. It is just a programmable calculator - instructions available are add a, b, c and subtract a, b, c. (a, b, c are numbers in memory. no way to load these numbers from constants ). One way to do it would be to do the following all in one clock cycle:

1. read the instruction and figure out what we're going to do
2. read memory location a
3. read memory location b
4. perform the add or subtract
5. write the result to location c

With this setup, the IPC is exactly 1, because one instruction takes one (VERY long) clock cycle. Now, let's improve this design. We're going to have 5 clock cycles per instruction, and each doing one of the 5 things above. So, on cycle 1, we decide what to do, on cycle 2, we read a, on cycle 3, we read b, and so on. Note that the IPC will be 1/5th. The thing you have to remember is, ideally each of those steps takes 1/5th of the time, so the end result is the SAME performance.

A more advanced implementation is a pipelined processor - multicycle like the one described, but we do more than one thing at a time:
1. read instruction i
2. read a (for instruction i), and read instruction ii
3. read b (for instruction i), a (for instruction ii), and instruction iii
4. do the op for instruction i, read b for instruction ii, read a for instruction iii, and read instruction iv
5. write c for instruction i, operate for ii, read b for iii, read a for iv, and read the instruction v
6. store c for ii, operate for iii, read b for iv, read a for v, and read vi

(note that this requires the ability to do 3 or 4 memory accesses in a cycle, which I didn't have in the other 2, but for the sake of understanding the concepts this can be ignored)

A picture would really help, but I don't have one offhand. To see how this performs, note that a given instruction takes 5 cycles from start to finish, but at any time, multiple instructions are being processed. Also, every single cycle, one instruction is completed (well, from the 5th cycle forward). So, the IPC is 1, even though each individual instruction takes a bunch of cycles, and the actual performance of the machine is 5 times the performance of the original, since the clock is 5 times faster.

Now, a modern processor is MUCH more advanced than this - there are multiple pipelines working on multiple instructions, instructions are executed out of order, etc., so you can't just do a simple analysis like this to see how an Athlon will perform vs. a P4. In general, a longer pipeline lets you do less in each stage, so you can clock the design faster. The P4's 20 stage pipeline lets it run at up to 3ghz currently, whereas the shorter pipeline of the Athlon results in more work per clock, and therefore a slower max clock speed. (There are other factors that play into this - Intel may have better chip fabs than AMD, so their transistors are better, and the P4 and Athlon both execute instructions VERY VERY differently, but again that is beyond the scope of this course - other than the fact that transistor speed doesn't really affect IPC, just the clock rate. I'll shut up before I go over my own head. )

Fanboys will often say that AMD is "more efficient" because more work is done per clock, but the goal of just high IPC is stupid if you look at the above examples - you have to consider how fast you can clock the machine as well. A good processor is a fast processor - if Via came out with a chip powered by sewage running through pipes that gave you 6000FPS in Doom III, does it really matter that the implementation is ugly and smelly?

edit: ok here is a simple pipeline example. One machine washes AND dries your clothes. It takes 1 hour per cycle. Another setup has separate washers and driers. Each cycle takes half an hour, but it takes 2 cycles to do a load of clothes. The absurd extreme would be to have Soak, agitate, rinse, spin all in separate machines, and a bunch of drying stages, with each stage maybe 5 minutes.
 

isaacmacdonald

Platinum Member
Jun 7, 2002
2,820
0
0
umm...that didn't really explain it either.

the quote explained the concept of instructions well, but it simply said that the p4 has a longer pipeline thereby allowing for greater mhz. It didn't explain WHY a longer pipeline = more mhz. In fact, aside from better MFG methods, the description of instructions per cycle versus MHZ capacity sounded very much like a 0 sum situation.

can anyone explain specifically why a longer pipeline gives you more head room in the MHZ dept? I'm also very interested after reading that anandtech article.
 

f95toli

Golden Member
Nov 21, 2002
1,547
0
0
I read about a 100 GHz chips made by IBM recently. If I understand what you wrote here correctly that only means that one instruction is executed every 1/100e9 s, it doesn't really mean that one instruction will go through the WHOLE chip in 1/100e9 s, right?

So if I wanted to add two numbers (coming from outside the chip) it wouldn't really take 1/100e9 s (assuming the ADD could be done in one clock cycle)? The minum amount of time would be (the length of the pipeline)*1/100e9 s.
Is this correct?


 
Jun 26, 2002
185
0
0

About the longer pipeline allowing for more MHz lets try this.

Lets say you have to run 20 instructions in the pipeline taking 20 clock cycles. 10 of these instructions take half the time to run than the other ten. But since each instruction is allowed 1 clock cycle the system all runs at the same speed. If you then allow the slower 10 instruction 2 clock cycles to happen before getting the data and allow the faster 10 still only one clock cycle you make the pipleline 30 clock cycles, but still run the same 20 instructions. This is what Intel did to try to optimize the P4, but they didn't account for other side effects very well.

In other words, if you run a chip at 100Mhz and have 1 cycle per instruction, you can change it to allow 2 cycles per instruction and change the clock speed to 200Mhz. You get the same thing done, but run at a faster speed.
 

Venix

Golden Member
Aug 22, 2002
1,084
3
81
Let's ignore the lower-level transistor stuff and assume that we're building a processor with logic gates. Logic gates have a propagation delay between when the signal enters and exits the gate. Assume that all the gates we use in our processor have a 10 ns delay, and assume that one section of the processor has 50 of these gates in series, meaning we have to wait 500 ns from the time when the data is input to when the result is available. This means instructions can be executed at most at 2,000,000 times per second (1/500E-9), or at a frequency of 2 MHz.

To pipeline the processor, we can put a latch after the 25th gate. What the latch does is grab whatever signal is on its input when the clock cycles, and set its output to hold that signal until the next clock cycle. Now, after 25 gates (250 ns), the signal is latched and held constant for the remaining 25 gates, meaning we can increase the clock to 1/250E-9--4 MHz. The instruction will now take two clock cycles to complete, but the benefit of pipelining is that after the signal is latched, the first half of the pipeline is free to be used by another instruction, so we'll effectively have one instruction coming out of the pipeline every clock cycle.
 

f95toli

Golden Member
Nov 21, 2002
1,547
0
0
So how long does it take to add two numbers x and y using a 3 GHz P4?
Assuming you start counting the time from when the data "enters" the processor. Not 1/3e9 s, right?

 

yodayoda

Platinum Member
Jan 8, 2001
2,958
0
86
as i understood it, having longer pipelines does not per se give you higher clockspeeds, but rather because of the architecture of the chip, you are doing less work per clock tick. by dividing a given instruction is smaller steps, you may more efficiently package an architecture of a chip, which would allow you to run at higher clockspeeds. the downside is that if you have too long of a pipeline, your IPC (instructions per clockcycle) rate decreases because of large penalties for branch instructions. that's why a 3GHz intel chip (with 20 stage pipeline) works like a 2GHz amd chip (with 12 stage pipeline) or like a 1GHz apple/motorola chip (with 4 stage pipeline).
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: yodayoda
as i understood it, having longer pipelines does not per se give you higher clockspeeds, but rather because of the architecture of the chip, you are doing less work per clock tick. by dividing a given instruction is smaller steps, you may more efficiently package an architecture of a chip, which would allow you to run at higher clockspeeds. the downside is that if you have too long of a pipeline, your IPC (instructions per clockcycle) rate decreases because of large penalties for branch instructions. that's why a 3GHz intel chip (with 20 stage pipeline) works like a 2GHz amd chip (with 12 stage pipeline) or like a 1GHz apple/motorola chip (with 4 stage pipeline).

That isn't necessarily true... first off, read any benchmark - the motorola chips get destroyed by the x86 competitors. If IPC was linearly affected by pipeline stages, the 1GHz chip woulnd't be getting smacked. Also, I'm pretty sure it takes the Athlon more than 2ghz for the 3000+ rating.