IPC of processors

imgod2u

Senior member
Sep 16, 2000
993
0
0
Ok, the more and more I look towards the IPC of a processor the more I'm a little confused. My thinking was that an instruction went into the processor, went through the stages of the pipeline, and came out. That would make a CPU take in 1 instruction per clock, and finish 1 instruction per clock, provided the pipeline was filled and there were no branch mispredicts. Then what is this I hear about 3-way cast and finishing 3 instructions per clock? I mean, if a processor's pipeline were to complete 1 instruction per clock under perfect conditions, how does it do 3?
 

rimshaker

Senior member
Dec 7, 2001
722
0
0
Superscalar design and out-of-order execution. Instructions don't have to be fully completed before the next one is processed. There's no 'standing in line' in other words. Just for the record, the P4 does 6 IPC, the XP does 9 IPC.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
The goal of most modern microprocessors is to extract parallelism out of the code by executing multiple instructions at a time. Superscalar architectures, that most common to current microprocessors, fetch, issue, and execute multiple independent instructions at a time. You might conceptually think of superscalar as having multiple pipelines, but this is a bit of a misnomer.

A modern out-of-order superscalar MPU fetches and decodes multiple instructions per clock cycle, then stores the instructions in a reorder buffer. The reorder "window" is determined by the size of the buffer, which may be split into independent buffers for integer and FP instructions. The instructions are then dynamically scheduled and issued to execution units out-of-order; the issue width may be larger than the fetch width; for example, the Athlon can fetch and decode 3 x86 instructions/cycle (for the P3, P4, and Athlon, x86 instructions are decoded into smaller "micro-ops", at an average of 1 - 2 micro-ops/x86 instruction) yet issue 9 micro-ops out of the reorder buffers (3 integer, 3 address generation, and 3 FP units). After execution, the results are retired in-order to preserve the context of the code.

There are three main hazards that prevent the instruction throughput from being that of the maximum; data, memory, and control/branch hazards. Data hazards involve the use of data before it may be ready, and are divided into three categories: read-after-write, write-after-read, and write-after-write. For example, if (in pseudo-code, with RX representing register X) the following two instructions are presented:

R1 = R2 + R3
R4 = R5 + R1

...the use of register R1 presents a read-after-write conflict. The latter two data hazards are solved by using register renaming; although an instruction set may only specify 32 (or 8 in the case of x86) architectural registers, many more physical registers may exist. Every left-hand-size use of a register aliases the architectural register onto a physical register, and every right-hand-side use uses the current architectural -> physical mapping. This, unfortunately, does not solve read-after-write dependencies, so these instructions must be issued sequentially rather than simultaneously or out-of-order (not exactly true in the case of the P4's "double-pumped" ALUs, they can effectively issue a read-after-write dependent set of instructions simultaneously). Note also that the number of physical renaming registers limits the reorder window.

The second hazard, memory, involves the memory latency to execute a load or store instruction. If an arithmetic instruction uses a register after data in memory is loaded into said register, the arithmetic instruction must wait until the data is loaded. Fast caches reduce the latency, but a cache miss may mean the arithmetic instruction has to stall for 100 clock cycles or more. The out-of-order execution afforded by the reorder window ameliorates this effect, but this is still a major limiting factor in performance.

The last hazard, control/branch, sounds like one you're familiar with....a miss-predicted branch requires the pipeline(s) to be flushed and refilled, increasing execution latency.

In addition, available memory bandwidth is a factor. Assuming that 1/2 of the instructions are load/stores, the entire cache hierarchy has a 99% hitrate (with a 64-byte block size), and instructions and data are 4 bytes, a 2GHz 3-way fetch OOOE superscalar core needs around 30 GB/sec of cache bandwidth and around 5 GB/sec of main memory bandwidth to sustain the maximum IPC.

Thus even though an MPU like the Athlon can fetch 3 x86 instructions/cycle and issue 9 micro-ops/cycle, average throughputs are closer to 1 - 1.2 x86 instructions/cycle. Aside from the fact that desktop MPUs have to make cuts that high-end MPUs can afford, x86's parallelism is limited by its 8 architectural registers, two-operand format, and too many flag register dependencies.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
This is very interesting. I know that the pipeline works on multiple things at once (stage 1 works on 1 instruction while stage 2 works on another one in a later stage) but multiple pipelines? Also, since 1 operation doesn't neccessarily mean 1 instruction and an instruction may take as many as 3 FP micro-ops (the max of an Athlon), would that mean that if 1 instruction at the end of "one" of the pipelines took up all 3 FPU's during a clock, all the others would have to wait? I'm familiar with the concept of the pipeline and working on different instructions in different stages of execution, I just didn't know modern MPU's could execute multiple instructions.
 

bizmark

Banned
Feb 4, 2002
2,311
0
0
With regard to multiple pipelines, the AMD website says the Athlon XP has 3 integer pipelines and 3 floating-point pipelines, while the P4 has 4 integer pipelines and 2 FP pipelines. I am curious as to how this is actually implemented.

Are these true 'pipelines' like you guys are discussing here, or are these somehow a watered-down "layman's pipeline"? Are they completely run in parallel, or are they somehow sequential? When we say that the P4 has a 20-stage pipeline, which pipeline are we talking about? The longest one? The sum of all of them?
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Also, since 1 operation doesn't neccessarily mean 1 instruction and an instruction may take as many as 3 FP micro-ops (the max of an Athlon), would that mean that if 1 instruction at the end of "one" of the pipelines took up all 3 FPU's during a clock, all the others would have to wait?
Potentially, issuing of uops is going to be dependent on the number of issue ports and execution units. But keep in mind that the x86 -> uop decoding often decouples the execution...the reason that, on average, an x86 instruction decodes to 1 to 2 uops is because x86 allows both registers and memory addresses to be specified as operands in an int or FP arithmetic instruction (whereas classically RISC only allows registers). In the case where a simple arithmetic instruction uses both a register and a memory address, it gets decoded into two uops; one arithmetic uop using only renaming registers, and a load/store uop that load/stores the data to the specific renaming register. Once, for example, the load uop is completed, the single corresponding arithmetic uop can be issued. There are more rare cases where deprecated x86 instructions may be decoded into a longer series of airthmetic uops, such as those for the transcendental functions (sine, cosine, tangent, etc). Though few x86 instructions occupy more than 4 uops, and the average is around 1.5.

Are these true 'pipelines' like you guys are discussing here, or are these somehow a watered-down "layman's pipeline"?
More like the latter....there aren't necessarily nine independent pipelines in the Athlon. It's best to conceive an OOOE superscalar MPU to have two decoupled parts: the front end and the back end. The front end fetch, decodes, and (under some definitions) does the register renaming and instruction queuing. It typically supports a width of up to 3 to 4 instructions/cycle. The back end schedules and issues the instructions to multiple execution units (which may be thought of as independent pipelines), then retires instructions in-order.

Take, for example, the P4 (here's a simple block diagram). In the front end, the P4 can fetch up to 3 uops/cycle from its trace cache (actually 6 uops every other cycle); the register renaming is then performed on all uops, then they are put into one of two queues: the load/store queue, or the int/FP queue. Each queue has two scheduling/issuing ports that issue instructions to one of seven execution units: two "double-speed" ALUs for common int functions, one "normal-speed" ALUs for shift and rotates, a load unit, a store unit, an FP move unit for loading/storing FP data, and an FP-execute unit. In the int/FP queue, port 0 can issue a single uop in the first half of a clock cycle to either the first fast ALU or the FP-move unit; in the second half of a clock cycle, it can issue another uop to the fast ALU. Port 1 can issue a uop in the first half of a clock cycle to the second fast ALU, the "slow" ALU, or the FP-execute unit; likewise, in the second half of a clock cycle, it can issue another uop to the second fast ALU. Ports 3 and 4 for the load/store queue can respectively issue a load and a store uop each cycle. After execution, results are written back at the rate of (IIRC) 3 uops/cycle.

When we say that the P4 has a 20-stage pipeline, which pipeline are we talking about? The longest one? The sum of all of them?
The "length" of an architecture's pipeline is generally defined as the number of stages for integer instructions, from instruction fetch to retire. So integer instructions generally go through 20 stages in the P4, assuming a single cycle for the execution stage.

Arstechnica has some good introductory articles on the microarchitectures of the P4, Athlon, G4, and others.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Isn't there a fundamental limitation to how many x86 instructions can be fetched every clock cycle? Also, if the 20-stage pipeline is the length of the integer pipeline, what about the FP/SSE/SSE2 operations? Also, if a hit is found in the trace cache, could micro-ops still be fetched from the trace cache and from the decoding unit at the same time? (Meaning could it issue past 3 micro-ops).
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Isn't there a fundamental limitation to how many x86 instructions can be fetched every clock cycle?
There are a number of factors that hurt parallelism. The variable-length instruction word length and large number of prefixes makes decoding complex. In addition, the two-operand format and 8 architectural registers forces a larger reliance on loads/stores and potentially introduces more data and memory hazards.

Also, if the 20-stage pipeline is the length of the integer pipeline, what about the FP/SSE/SSE2 operations?
FP pipelines generally are a few stages longer, I don't recall any numbers off-hand.

Also, if a hit is found in the trace cache, could micro-ops still be fetched from the trace cache and from the decoding unit at the same time? (Meaning could it issue past 3 micro-ops).
Nope, AFAIK the P4 only fetches uops from the trace cache.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
So how many issues can the Athlon fetch from its decoding unit? What about the Motorola G4? And if the P4 can only fetch 3 uops per clock, why so many ALU units?
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
And if the P4 can only fetch 3 uops per clock, why so many ALU units?
Well, there's two issues here (no pun intended): number of execution units, and issue width. The reason that there are typically more execution units than fetch width in superscalar is that execution units are usually not symmetric; they have specific functions, ie integer execution, integer address gen & load/store, floating-point. So if you have a 3-way fetch core, ideally you want three of each type so that if the possibility exists of issuing three integer execution instructions in a cycle, you won't be starved for execution resources. The probable reason for the P4's two double-speed ALUs is that two double-speed ALUs can sustain the same throughput as four normal ALUs, but with less die area and power consumption (the same is true vs. three normal ALUs). So even though the P4 can sustain 33% more integer uops/cycle than the Athlon, this bursty behavior may only come into play a small percent of the time. The other distinct advantage of the double-speed ALUs is that they can issue two read-after-write dependent instructions in the same cycle (one at the beginning of the cycle, one half-way-through the cycle), something no number of normal ALUs can do (again, this only comes into play a small percentage of the time).

The reason a superscalar core may issue more instructions/cycle than it can fetch (ie, 9 for the Athlon and 6 for the P4) is to allow for bursty behavior. Even though the sustained throughput may be significantly less, it might be advantageous to be able to briefly issue, for example, 9 instructions/cycle if the reorder buffers have filled due to a memory stall. If you already have, for example, 9 execution units, it probably costs little in terms of die area and critical path delay to allow for the 9 instruction issue even though it may only add a few percent in overall performance.

So how many issues can the Athlon fetch from its decoding unit?
The Athlon has three parallel decoders, so it can probably decode a sustained 3 to 6 uops/cycle (longer uop decodes use the microcode controller rather than the hardware decoder). Though this only represents burst behavior; besides the hazards that prevent a sustained throughput of 3 x86 IPC, the main memory bandwidth required would be well over 5 GB/sec at 2GHz assuming likely cache sizes.

What about the Motorola G4?
The G4+ can fetch (and issue, I think) 3 instructions/cycle. Some documentation says four, but that includes one branch instruction/cycle....but branches are generally handled in the front-end, so I don't consider that part of the execution core. IIRC, the G4+ has three normal IEUs (for adds, subs, and logical instructions), one slow IEU (for multiplies, divides, etc), one integer load/store unit, one FPU (a bit weak), and four vector units for Altivec (normal integer, complex integer, FP, and permute).