- Jan 20, 2002
- 1,632
- 1
- 0
Having just read the Banias technology article, I realised that I havn't got a clue why increasing the length of a pipeline allows higher frequencies. Anyone care to explain?
Originally posted by: rimshaker
In a nutshell, in order for the cpu core to run efficiently, you always want the pipeline full. With a longer pipe, it takes a lot more clock cycles to fill it up. Of course the big downside is if there's a miss in the data stream. The entire pipeline has to be flushed out and repacked. It's a big hit when the pipeline is 20 stages. But 20stages allows for a big clock speed headroom.
Lets take A VERY simple processor. It is just a programmable calculator - instructions available are add a, b, c and subtract a, b, c. (a, b, c are numbers in memory. no way to load these numbers from constants ). One way to do it would be to do the following all in one clock cycle:
1. read the instruction and figure out what we're going to do
2. read memory location a
3. read memory location b
4. perform the add or subtract
5. write the result to location c
With this setup, the IPC is exactly 1, because one instruction takes one (VERY long) clock cycle. Now, let's improve this design. We're going to have 5 clock cycles per instruction, and each doing one of the 5 things above. So, on cycle 1, we decide what to do, on cycle 2, we read a, on cycle 3, we read b, and so on. Note that the IPC will be 1/5th. The thing you have to remember is, ideally each of those steps takes 1/5th of the time, so the end result is the SAME performance.
A more advanced implementation is a pipelined processor - multicycle like the one described, but we do more than one thing at a time:
1. read instruction i
2. read a (for instruction i), and read instruction ii
3. read b (for instruction i), a (for instruction ii), and instruction iii
4. do the op for instruction i, read b for instruction ii, read a for instruction iii, and read instruction iv
5. write c for instruction i, operate for ii, read b for iii, read a for iv, and read the instruction v
6. store c for ii, operate for iii, read b for iv, read a for v, and read vi
(note that this requires the ability to do 3 or 4 memory accesses in a cycle, which I didn't have in the other 2, but for the sake of understanding the concepts this can be ignored)
A picture would really help, but I don't have one offhand. To see how this performs, note that a given instruction takes 5 cycles from start to finish, but at any time, multiple instructions are being processed. Also, every single cycle, one instruction is completed (well, from the 5th cycle forward). So, the IPC is 1, even though each individual instruction takes a bunch of cycles, and the actual performance of the machine is 5 times the performance of the original, since the clock is 5 times faster.
Now, a modern processor is MUCH more advanced than this - there are multiple pipelines working on multiple instructions, instructions are executed out of order, etc., so you can't just do a simple analysis like this to see how an Athlon will perform vs. a P4. In general, a longer pipeline lets you do less in each stage, so you can clock the design faster. The P4's 20 stage pipeline lets it run at up to 3ghz currently, whereas the shorter pipeline of the Athlon results in more work per clock, and therefore a slower max clock speed. (There are other factors that play into this - Intel may have better chip fabs than AMD, so their transistors are better, and the P4 and Athlon both execute instructions VERY VERY differently, but again that is beyond the scope of this course - other than the fact that transistor speed doesn't really affect IPC, just the clock rate. I'll shut up before I go over my own head. )
Fanboys will often say that AMD is "more efficient" because more work is done per clock, but the goal of just high IPC is stupid if you look at the above examples - you have to consider how fast you can clock the machine as well. A good processor is a fast processor - if Via came out with a chip powered by sewage running through pipes that gave you 6000FPS in Doom III, does it really matter that the implementation is ugly and smelly?
Originally posted by: yodayoda
as i understood it, having longer pipelines does not per se give you higher clockspeeds, but rather because of the architecture of the chip, you are doing less work per clock tick. by dividing a given instruction is smaller steps, you may more efficiently package an architecture of a chip, which would allow you to run at higher clockspeeds. the downside is that if you have too long of a pipeline, your IPC (instructions per clockcycle) rate decreases because of large penalties for branch instructions. that's why a 3GHz intel chip (with 20 stage pipeline) works like a 2GHz amd chip (with 12 stage pipeline) or like a 1GHz apple/motorola chip (with 4 stage pipeline).