Back in the days before pipelining, a clock cycle was the entire progression of an instruction through the four main stages:
Fetch -> Decode -> Execute -> Store.
There was only ever one instruction being worked on at any one time, meaning that in the case of very early x86 processors, only 1/4 of the processor's resources were being used in any instance.
Because the processor could only work on one instruction at a time, increasing the instruction latency through increasing the number of stages in the processor core would have had a direct impact on performance, since performance in non-pipelined processors was entirely determined by the time it took to execute each instruction.
The longer an instruction took to execute, the longer the delay in executing other instructions.
Pipelining completely changed all of this.
Now, instruction latency does not have a direct impact on performance, and it can be said that there is no direct relationship between IPC and pipeline depth!. With pipelining, an instruction doesn't have to wait for a previous instruction to clear the pipeline; instructions now enter the pipeline one after the other, which obviously allows a far more efficient use of execution resources.
Previously, with non-pipelined processors, a single clock cycle was the progression of an instruction through all the processor's execution stages, so an instruction was completed evey clock cycle.
With pipelined processors, an instruction is completed every clock pulse once the pipeline is full, which is why a clock pulse is now referred to as a clock cycle.
As I've stated above, one of the biggest misconceptions is that pipeline depth directly affects a processor's IPC - the number of instructions it can execute per clock.
It doesn't. Looked at from a theoretical perspective, increasing the number of pipeline stages has absolutely no effect on the maximum throughput a processor is capable of, thanks to pipelining.
Doubling the number of pipeline stages does not halve IPC, but it does have an impact on instruction latency - the time it takes to flush and refill the pipeline. If there are lots of pipeline refills (often as a result of a mispredicted branch), then performance is going to suffer. However, if improvements are made to the branch predictor, then the impact of this can be minimalised.
If we take a look at Precott vs. Northwood benchmarks, we see that the two are always within a few percent of eachother, with Prescott occasionally ahead, despite it having a 55% deeper pipeline (and a 55% higher instruction latency at the same clock speed).
What should be mentioned here, is that the whole point of implementing more pipeline stages is to allow each stage to be executed more quickly. If we can halve the time it takes an instruction to complete each pipeline stage, we can double clock speed, which doubles our maximum theoretical throughput, while in this case acheiving the same instruction latency(!).
What effects IPC directly is execution width - the number of instructions that can be executed in parallel.
If we double the number of execution units, we double the theoretical IPC maximum.
However, adding additional execution units is no simple matter. It requires additional logic to find instructions the execution core can execute in parallel, since x86 code itself is seldom written in a way that makes life easy for multiple execution units.
This is especially true with integer code. Some instructions require the results of previous instructions, and cannot be execute in parallel.
Intel and AMD took two different approaches to processor design.
Intel took a serial approach, AMD a parallel approach. The serial method allows higher clock speeds, while the parallel method allows higher IPC.
Thus, the clock speeds of these two architectures cannot be compared.