Also, since 1 operation doesn't neccessarily mean 1 instruction and an instruction may take as many as 3 FP micro-ops (the max of an Athlon), would that mean that if 1 instruction at the end of "one" of the pipelines took up all 3 FPU's during a clock, all the others would have to wait?
Potentially, issuing of uops is going to be dependent on the number of issue ports and execution units. But keep in mind that the x86 -> uop decoding often decouples the execution...the reason that, on average, an x86 instruction decodes to 1 to 2 uops is because x86 allows both registers and memory addresses to be specified as operands in an int or FP arithmetic instruction (whereas classically RISC only allows registers). In the case where a simple arithmetic instruction uses both a register and a memory address, it gets decoded into two uops; one arithmetic uop using only renaming registers, and a load/store uop that load/stores the data to the specific renaming register. Once, for example, the load uop is completed, the single corresponding arithmetic uop can be issued. There are more rare cases where deprecated x86 instructions may be decoded into a longer series of airthmetic uops, such as those for the transcendental functions (sine, cosine, tangent, etc). Though few x86 instructions occupy more than 4 uops, and the average is around 1.5.
Are these true 'pipelines' like you guys are discussing here, or are these somehow a watered-down "layman's pipeline"?
More like the latter....there aren't necessarily nine independent pipelines in the Athlon. It's best to conceive an OOOE superscalar MPU to have two decoupled parts: the front end and the back end. The front end fetch, decodes, and (under some definitions) does the register renaming and instruction queuing. It typically supports a width of up to 3 to 4 instructions/cycle. The back end schedules and issues the instructions to multiple execution units (which may be thought of as independent pipelines), then retires instructions in-order.
Take, for example, the P4 (
here's a simple block diagram). In the front end, the P4 can fetch up to 3 uops/cycle from its trace cache (actually 6 uops every other cycle); the register renaming is then performed on all uops, then they are put into one of two queues: the load/store queue, or the int/FP queue. Each queue has two scheduling/issuing ports that issue instructions to one of seven execution units: two "double-speed" ALUs for common int functions, one "normal-speed" ALUs for shift and rotates, a load unit, a store unit, an FP move unit for loading/storing FP data, and an FP-execute unit. In the int/FP queue, port 0 can issue a single uop in the first half of a clock cycle to either the first fast ALU or the FP-move unit; in the second half of a clock cycle, it can issue another uop to the fast ALU. Port 1 can issue a uop in the first half of a clock cycle to the second fast ALU, the "slow" ALU, or the FP-execute unit; likewise, in the second half of a clock cycle, it can issue another uop to the second fast ALU. Ports 3 and 4 for the load/store queue can respectively issue a load and a store uop each cycle. After execution, results are written back at the rate of (IIRC) 3 uops/cycle.
When we say that the P4 has a 20-stage pipeline, which pipeline are we talking about? The longest one? The sum of all of them?
The "length" of an architecture's pipeline is generally defined as the number of stages for integer instructions, from instruction fetch to retire. So integer instructions generally go through 20 stages in the P4, assuming a single cycle for the execution stage.
Arstechnica has some good
introductory articles on the microarchitectures of the P4, Athlon, G4, and others.