6 stage pipeline vs 20 stage pipeline

popo · Jun 13, 2002

Howdy crew,

In summary: How does a longer pipeline increase IPC ?

How does a 6-stage pipleline slow down the instructions per
clock cycle and a 20-stage speeds the IPC up ??

I thought by having a longer pipeline it would slow things down ie more to do.

Pointers ??

Thanks
- Alex

Chesebert · Jun 13, 2002

IPC is not determined by the pipeline length

Unless the processor is an Out of Order execution machine where the pipeline stages are added in the execution part of the processor. This is the equivelent of adding more execution engines. Ofcourse if that's the case then your IPC would increase.

But if just adding stages to incrase the Mhz, then your IPC would not change (...hmm..I could be wrong)

The fact is that modern processor is so complex that without all the information I can't really tell you how IPC is affected by the pipline length.

correct me if I am wrong

Sohcan · Jun 13, 2002

Actually, the effect is opposite of what you've stated. In theory, any pipeline that is always full, regardless of its length (assuming a single scalar MPU) has an IPC of 1: it completes one instruction per cycle. Given two MPUs of similar architecture, the one that has a longer pipeline, and is likely to have a faster clock rate, will thus be faster. In practice the clock rate gain is limited by latch overhead and clock skew, so, for example, the 20-stage P4 demonstrates about a 1.8x increase in clock rate over the 10-stage P3 (which is true at .18u).

In practice, modern superscalar MPUs exhibit both dynamic branch prediction and speculative execution. When a branch is encountered, its outcome is predicted (statically or dynamically) and execution follows along that path. If the prediction from the branch-target buffer turns out to be incorrect, the reorder buffers must be flushed and processing begins along the correct path. For an MPU with a longer pipeline, it takes more clock cycles to refill the pipeline; thus, for a given branch prediction rate, the longer pipeline may exhibit a lower IPC. In addition, while not directly related to pipeline length, the MPU with the faster clock rate will experience a greater penalty due to memory stalls that arise from cache misses. From the perspective of the MPU, missing the cache and going to main memory takes over 100 cycles, so any other dependent instructions that exhibit data hazards from that memory load will stall (though the effect is ameliorated by the presence of the dynamically scheduled pipeline and reorder buffers).

There has been a bit of debate over this issue in the scholarly world, but it has already been proven that "superpipelining," beyond that of the basic 5-stage MIPS pipeline, has been quite effective given the added complexity of the dynamically scheduled, superscalar MPUs since the mid-90s. In many cases, more pipeline stages may be absolutely necessary; for example, the single instruction decode stage may necessarily need to be split into an issue and a read operand stages for a dynamically scheduled pipeline. Obviously the P4's 20-stage integer pipeline from fetch through retirement is longer than the older 7- to 10-staged superpipelines, but other designs have begun to show longer pipes. The IBM Power4 has a 14-stage pipeline, and the now defunct Alpha EV8 had a planned 18-stage pipeline...though the latter is not necessarily to reach stellar clock speeds, it had to deal with the complexity of an SMT core with 8 thread contexts, a huge 512x64bit register file, and 16 functional units among other things. I did read a paper a while back (published by a researcher at Intel, but still worth discussing) that explored ideal pipeline depths of a dynamically scheduled speculative MPU given a range of pipeline latch overhead, pipelines of up to 40- to 70-stages may be possible. I'll see if I can find a link to that paper....

AndyHui · Jun 13, 2002

Check out the answer PM and I put together in this thread.

It should give you a pretty good answer (especially Patrick's).

Chesebert · Jun 13, 2002

Just a minor point of note:

single pipline does not have an IPC of 1 for the following reasons:

1. branch misprediction (solve with Branch Prediction Unit and Branch history table)
2. load stalls (Speculative loading...hmm..that's complex to implement)
3. fetch stalls (Pre-fetching)

And these are just from the CPU point of view (wait til you go down the memory lane - L2, L3, MEM, HD

Your IPC would go down the toliet if you found that the data you need is actually in the virtual memory that's residing on the HD...ouch!

popo · Jun 13, 2002

Originally posted by: Sohcan
In theory, any pipeline that is always full, regardless of its length (assuming a single scalar MPU)....

What is an MPU ?

- Alex

blawson · Jun 13, 2002

The basic jist of it is even though any pipeline will be finishing one instruction per cycle, a 20 stage pipeline will (ideally) be working on 20 instructions per cycle, vs. 6 in the 6 stage pipeline.

Another thing to keep in mind: By increasing the length of the pipeline, you are ( I believe ) basically breaking up the instruction execution cycle into smaller chunks. I.E. You are still finishing an instruction in the same time, it just happens to take 20 stages as opposed to 6. Then one cycle becomes a smaller unit of time (in this case ~1/3 of the previous cycle time), and you become able to pump out more instructions -- 20 instructions now in the time it took you to do 6 before. This of course is ideal, and is subject to your usuall branch hazards and data dependancy hazards.

Sohcan is obviously very intelligent, but I thought I'd add my two cents (got to work up the posts somehow)

.

Sohcan · Jun 13, 2002

What is an MPU ?

Microprocessor unit, a more generalized term for CPU....I guess I use it out of a force of habit.

I didn't mean to clutter my previous post with the dynamic scheduling terms (if they were confusing), I was about to go to bed and I rambled a bit.

thorin · Jun 13, 2002

"Another thing to keep in mind: By increasing the length of the pipeline, you are ( I believe ) basically breaking up the instruction execution cycle into smaller chunks. I.E. You are still finishing an instruction in the same time, it just happens to take 20 stages as opposed to 6. Then one cycle becomes a smaller unit of time (in this case ~1/3 of the previous cycle time), and you become able to pump out more instructions -- 20 instructions now in the time it took you to do 6 before. This of course is ideal, and is subject to your usuall branch hazards and data dependancy hazards."

Hmmm I'm not sure I agree with this since I've seen tables that state point blank that a Athlon XP has a IPC count of 9 while the P4 with it's longer pipe has a IPC count of 6.

As explained here:

Still, even with that in mind, it's obvious that clock-frequency isn't the sole deciding factor in system performance. If it was, the P4 would have crushed the XP in the marketplace long ago. In reality, the number of instructions a CPU can actually complete per cycle (expressed as "IPC") is just as important as the number of cycles it goes through in a second. The Pentium 4, with its ultra-long hyperpipeline, is able to achieve astronomic clock frequencies, but at the price of lower IPC performance. The Athlon XP, on the other hand, goes through fewer cycles per second, but manages to get more work done on each pass -- 9 instructions per cycle, as opposed to the P4's 6 -- giving it a 150% advantage in IPC.

Thorin

Sohcan · Jun 13, 2002

Hmmm I'm not sure I agree with this since I've seen tables that state point blank that a Athlon XP has a IPC count of 9 while the P4 with it's longer pipe has a IPC count of 6.

This is a common misconception...while the Athlon does issue up to 9 uops/cycle from its reorder buffers, and the P4 six, issue rate (or number of functional units, for that matter) does not uniquely determine IPC. Dynamically scheduled microarchitectures decouple the front-end fetch mechanism from the back-end scheduling, execution, and retirement mechanism; peak IPC is determined by the fetch and retire rate, which is often less than the peak issue rate. The P4 and Athlon, despite their difference in issue rate, are still both essentially 3-way fetch superscalar cores. While the Athlon fetches/decodes up to 3 x86 instructions/cycle into uops (average 1 to 2 uops/x86 instruction) and the P4 fetches 3 uops/cycle from its trace cache, they still both retire 3 uops/cycle despite their difference in issue rates from their reorder buffers. Note that the P3 issues 5 uops/cycle from its issue ports vs. the P4's 6 uops/cycle; the P4 certainly doesn't have a 20% higher IPC than the P3, nor the Athlon 80% higher. In terms of x86 instructions, even the Athlon rarely achieves an IPC higher than 1.2 x86 instructions/cycle.

In actuality, IPC is determined by fetch/schedule/issue/retire rate; number and organization of functional units; pipelined instruction latency; reorder window size; number of renaming registers; in-order vs. out-of-order execution; speculative vs. non-speculative execution; pipeline length/branch mispredict penalty; clock rate; branch prediction/branch target buffer organization and accuracy; multilevel cache size, bandwidth, latency, associativity, block size, replacement algorithms, write-through/write-back characteristic; main memory latency and bandwidth; ISA characteristics: number of logical registers, number of operands; the compiler; the software;...and the kitchen sink.

In a similar fashion, it is often misconceived that the Athlon's 3 FP units (FP move, FP add, FP multiply) are responsible for it's higher x87 FP performance over the P4, which has 2 FP units (FP move, FP add/multiply). Yet at the same time, the P3 has the same basic FP unit organization as the P4 (and actually can only issue one FP uop/cycle vs. the P4's two), and still was very competitive with the Athlon at the same clock rate. Likewise, the Alpha EV6 and EV7 have two FP units vs. the Athlon's three (though the former are more symmetric IIRC), while the EV7 at 1.2GHz may come close to doubling the SPECFP 2K performance of the 1.8 GHz Athlon XP. In practice, FP performance is more determined by reorder window size, cache and system bandwidth, pipelined instruction latencies, and ISA characteristics among other things. Also, in the case of the P4, it is limited to fetching 1 FXCH instruction/cycle out of its trace cache (vs. the P3 and Athlon's 3); the FXCH instruction is heavily used in modern x87 software to attempt to emulate x87's FP stack into a flat register file. I've also read that the P4 is more sensitive to memory data alignment than previous x86 cores.

edit:
I'll try to be a little more clear on the width of the P4 and Athlon's pipeline. For the front-end (fetch, decode (for the Athlon), register rename and dispatch), the Athlon can fetch up to 3 x86 instructions/cycle and decode them into 3 to 6 uops...a register-memory x86 arithmetic operation gets decoded into a single register-register arithmetic uop and a load/store uop, while a register-register x86 instruction gets decoded into a single uop. It can then rename (72 total renaming registers) and dispatch up to 3 uops/cycle to the back-end's reorder buffers. In contrast, the P4 fetches up to 3 uops/cycle (actually 6 uops every other cycle) from its trace cache, and renames (126 total renaming registers)/dispatches 3 uops/cycle into the back end.

In the back end (schedule, execution and retirement), the P4 has a reorder window size of 128 instructions vs. the Athlon's 72. The Athlon can issue 3 integer execution uops/cycle, 3 address generation uops/cycle (which then issue to a 2 uop/cycle load/store unit), and 3 FP uops/cycle (one FP move, one FP add, one FP multiply). The P4, in contrast, shares some issue ports between its integer and FP units. It can issue 6 uops/cycle from four ports, two of which issue 2 uops/cycle since the two "double-speed" ALUs can each issue two (dependent or non-dependent) arithmetic uops/cycle. The lesser-used non-ALU integer execution unit is "normal speed." Also, the P4's memory units handle both address generation and load/stores (unlike the Athlon's, which are seperate). Thus port-0 can issue a single uop in the first half of a clock cycle to either the first fast ALU or the FP-move unit; in the second half of a clock cycle, it can issue another uop to the fast ALU. Port 1 can issue a uop in the first half of a clock cycle to the second fast ALU, the "slow" ALU, or the FP-execute unit; likewise, in the second half of a clock cycle, it can issue another uop to the second fast ALU. Ports 3 and 4 for the load/store queue can respectively issue a load and a store uop each cycle. Finally, both the Athlon (IIRC) and P4 retire 3 uops/cycle.

thorin · Jun 14, 2002

Nice post sohcan Thanks for the info

Thorin

Search

6 stage pipeline vs 20 stage pipeline

popo

Member

Chesebert

Golden Member

Sohcan

Platinum Member

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M

Chesebert

Golden Member

popo

Member

blawson

Junior Member

Sohcan

Platinum Member

thorin

Diamond Member

Sohcan

Platinum Member

thorin

Diamond Member

TRENDING THREADS