I'm gonna get pretty basic here.
Computer architecture usually teaches that a basic processor has 5 stages (this a simplified model):
Code:
1) Instruction Fetch
2) Instruction Decode
3) Execution
4) Memory access
5) Register Write-back.
Let's assume you built a processor that somehow completed all of these steps in 2 cycles. I'm not fully sure how you would do it, but that's ok.
Here's the hidden peace of information that might not be immediately obvious: Every one of those stages has certain latency. This latency comes about because the electric charge is not able to instantly rush in and fill the wires and transistors. There is a delay because of capacitance, inductance, and plain old resistance.
So if you were somehow able to compress those 5 stages into 2 stages, you still haven't gotten rid of the latencies. So, let's pretend here were the latencies with the 5 stage
Code:
1) Inst Fet 1ns
2) Inst Dec 1ns
3) Exec 1ns
4) Mem 1ns
5 WriteBack 1ns
You still have the same latencies to deal with, but now it's 2 cycles. Here's your new architecuter
Code:
1) Cycle1 2.5ns
2) Cycle2 2.5ns
You can't simply get rid of these latencies without creating some radical new design or using some new technology.
However, let's also assume you saved some latency because you don't need those pesky registers in between stages:
Code:
1) Cycle1 2ns
2) Cycle2 2ns
Alright. Now you think to yourself "Great! I can complete a single instruction in 4ns instead of 5!". I will also point out, at this stage, that your processor actually has to run at a frequency that is 2 TIMES SLOWER than the original architecture to accomodate the new longest delays (1ns -> 2ns). Your dream of reducing pipelining while increasing frequency seems sort of ridiculous.
You have gained the ability to complete a single instruction in 4ns, instead of 5ns. What have you lost? Something very important: you cannot pipeline your 2-cycle machine as deeply.
YOU ARE WASTING RESOURCES. Every sub-instruction has to wait for the next sub-instruction to complete one giant stage, even though, electrically, there are parts of that stage that next sub instruction is not even using anymore.
In both of these pipelined designs, we theoretically should be able to complete 1 instruction every cycle. However, the 5 stage pipeline will suffer more, as you point out, due to pipeline flushing and other shenanigans. So, let's assume the 5-stage pipeline actually has a IPC of 0.8 for some particular workload with various instructions while the 2 stage has an IPC of 1.0 for the same workload
Which do you think is going to be faster?
1) 5-stage processor running at 1Ghz (1ns clock period) with IPC of 0.8
2) 2-stage processor running at 500Mhz (2ns clock period) with IPC of 1.0.
1Ghz * 0.8 inst/cycle = 0.8 Giga instructions/second = 800 Mega instructions/sec
500Mhz * 1.0 inst/cycle = 500 Mega instructoins/second
All i'm saying is that you can't just claim to reduce pipeline depth and get free performance. If that were true every processor manufacturer would simply reduce there pipeline depth. Obviously it's a balancing act, and you have to choose the write pipeline depth.
Another thing i've completely ignored is power constrains, which as everyone remembers from pentium4/prescott days, can make the above analysis even more complex. Typically the power constraints favor lower frequencies, but it still does not allow us to get a simple answer.