The goal of most modern microprocessors is to extract parallelism out of the code by executing multiple instructions at a time. Superscalar architectures, that most common to current microprocessors, fetch, issue, and execute multiple independent instructions at a time. You might conceptually think of superscalar as having multiple pipelines, but this is a bit of a misnomer.
A modern out-of-order superscalar MPU fetches and decodes multiple instructions per clock cycle, then stores the instructions in a reorder buffer. The reorder "window" is determined by the size of the buffer, which may be split into independent buffers for integer and FP instructions. The instructions are then dynamically scheduled and issued to execution units out-of-order; the issue width may be larger than the fetch width; for example, the Athlon can fetch and decode 3 x86 instructions/cycle (for the P3, P4, and Athlon, x86 instructions are decoded into smaller "micro-ops", at an average of 1 - 2 micro-ops/x86 instruction) yet issue 9 micro-ops out of the reorder buffers (3 integer, 3 address generation, and 3 FP units). After execution, the results are retired in-order to preserve the context of the code.
There are three main hazards that prevent the instruction throughput from being that of the maximum; data, memory, and control/branch hazards. Data hazards involve the use of data before it may be ready, and are divided into three categories: read-after-write, write-after-read, and write-after-write. For example, if (in pseudo-code, with RX representing register X) the following two instructions are presented:
R1 = R2 + R3
R4 = R5 + R1
...the use of register R1 presents a read-after-write conflict. The latter two data hazards are solved by using register renaming; although an instruction set may only specify 32 (or 8 in the case of x86) architectural registers, many more physical registers may exist. Every left-hand-size use of a register aliases the architectural register onto a physical register, and every right-hand-side use uses the current architectural -> physical mapping. This, unfortunately, does not solve read-after-write dependencies, so these instructions must be issued sequentially rather than simultaneously or out-of-order (not exactly true in the case of the P4's "double-pumped" ALUs, they can effectively issue a read-after-write dependent set of instructions simultaneously). Note also that the number of physical renaming registers limits the reorder window.
The second hazard, memory, involves the memory latency to execute a load or store instruction. If an arithmetic instruction uses a register after data in memory is loaded into said register, the arithmetic instruction must wait until the data is loaded. Fast caches reduce the latency, but a cache miss may mean the arithmetic instruction has to stall for 100 clock cycles or more. The out-of-order execution afforded by the reorder window ameliorates this effect, but this is still a major limiting factor in performance.
The last hazard, control/branch, sounds like one you're familiar with....a miss-predicted branch requires the pipeline(s) to be flushed and refilled, increasing execution latency.
In addition, available memory bandwidth is a factor. Assuming that 1/2 of the instructions are load/stores, the entire cache hierarchy has a 99% hitrate (with a 64-byte block size), and instructions and data are 4 bytes, a 2GHz 3-way fetch OOOE superscalar core needs around 30 GB/sec of cache bandwidth and around 5 GB/sec of main memory bandwidth to sustain the maximum IPC.
Thus even though an MPU like the Athlon can fetch 3 x86 instructions/cycle and issue 9 micro-ops/cycle, average throughputs are closer to 1 - 1.2 x86 instructions/cycle. Aside from the fact that desktop MPUs have to make cuts that high-end MPUs can afford, x86's parallelism is limited by its 8 architectural registers, two-operand format, and too many flag register dependencies.