Originally posted by: Oreo
CTho9305, so is that a problem already? What would be the result if you had a 100GHz CPU (that did not have heat issues)?
You can't realistically. The current way to get a signal from one side of the chip to the other is to give it more than a cycle to arrive (you might put flip flops at intervals along a long wire to break it into cycles). I believe that's what the
Drive stages in the P4 pipeline are for.
At 100GHz, your cycle time is 10 picoseconds, and a fast inverter (simplest logic gate there is) takes a bit over 10 picoseconds on a modern manufacturing process. The thing is, between every pipeline stage, you need flip flops (they store the data), and flip flops cost you around 2 NAND gate delays (NANDs are slower than inverters - about half the speed). You can't do much logic with just inverters though.
If you wanted, you could do a CPU where every pipeline stage had one NAND gate, but there are a LOT of reasons this is a bad idea:
1. Flip flops are big - you'd have a giant flip flop, a tiny gate, then another giant flip flop
2. The pipeline would be impossible to fill. Given that adding two 32-bit numbers takes over 8 NAND delays, code like this presents a problem:
ADD a, b, c ;;;a = b+c
ADD d, a, a ;;;d = a+a
You couldn't start the second operation for
at least 8 cycles after the first one starts. Out-of-order execution can mitigate some of these dependencies, but there are limits (and in a real CPU, there's a lot more to executing an A+B operation than just an addition).
3. Branches would kill performance. Whenever the CPU hits an "if" instruction, it has to guess the result (~95% accuracy is near the current max, I think). About 1 in 5 instructions are "if", so you can see that the longer your pipeline, the more branches are "in flight" and could be wrong. If you have, say, 5 branches in flight (maybe a 25-stage pipeline, between Northwood and Prescott?), with 95% accuracy on each, there's only a 77% chance you predicted all of them correctly. You can see that you'd be spending a LOT of time executing the wrong path of a branch - work which gets thrown away.
4. There is a certain amount of delay involved in getting the clock signal routed across the chip (clock skew). I think in most designs, you sacrifice more than one gate delay to account for skew - basically the flip flop at the start of a pipeline stage could start at time 0+skew and the flip flop at the end of the stage could fire at time Tcycle-skew, so you can only use Tcycle-2*skew if you want the design to be robust.