In order to reduce the resistance and current leakage (two physical factors that cause problems when you try to increase a chip's GHz) CPU manufacturers are shrinking the transistor sizes.
Actually, smaller transistors (as in, 65nm vs 90nm vs 130nm) leak
more for a given width (though hopefully when you use smaller processes, you can shrink device widths).
Overclocking means higher working frequency means higher operations per second. Each operation current (electrons) goes through gates and metal connects, this in turn mean more heat generated per second.
To elaborate on that... ideal CMOS circuits (the circuit style used for most of a CPU) only dissipate power when they're switching. If you look at the right box in
this image, you can see crude side view of a transistor. On each transistor's input (the gate) there is effectively a capacitor. In order to switch the transistor from on to off (or back on), whatever is driving the input of this transistor needs to either charge or discharge that capacitor. This is one source of power. Power from switching capacitances like this has the equation P=C*V^2*F, or capacitance switched times (voltage squared) time frequency. As you can see, this depends directly on the clock frequency, so when you overclock by 10%, the frequency component goes up by 10%. Of course, to get the chip to work, you might have to raise the voltage by 10%, which adds another factor of 1.1^2=1.21... so the total power is 1.21*1.1=1.331, or about 33% more power than when it's not overclocked.
So, why do you have to run at a higher voltage when overclocking? Well, a certain number of gates need to switch within the cycle time of the processor. You could model the gates like resistors and capacitors as in the bottom box of
this image. Recall that V=IR, or alternately, I=V/R. The current a transistor can drive is related to the voltage over its resistance. The capacitances the transistor drives (from other gates' inputs) need to be switched within a certain amount of time, and if you overclock you leave less time for this to happen. To get the capacitances all charged in time, you need to increase the current, and you do this by increasing the voltage. (Note that V=IR is a really poor model for a transistor, but it conveys the point).
Now, if you have an inverter (the simplest gate... it's just easiest to describe, but this all applies to other gates), there are two transistors - a pmos device which can connect the output to the high voltage source (vdd), and an nmos device that can connect the output to the low voltage source (ground).
This image shows an inverter at the bottom and the currents at the top. Because the there is some capacitance at the gates of the two transistors, theit input can't be switched instantaneously, but rather it swings over a few picoseconds from high to low (or low to high). Note in the chart that while the voltage is not all the way at high or low, both the nmos and the pmos drive some current, so there is a direct path from vdd to ground. This "short-circuit" current (also called crowbar current) doesn't depend on frequency, but the power is dissipated every time a gate switches, and that
does depend on frequency.
A huge part of the power dissipated in modern processors comes from the chip's clock... about 1/3rd of the power actually results from just switching the clock each cycle. The reason the clock requires so much power is that it's a signal that goes pretty much everywhere across the chip (so there are long wires, which have high capacitance), and it has to switch twice every cycle. You can generate fast clocks that require less power, but there are tradeoffs are beyond the scope of this explanation
.
Non-ideal (i.e. real) transistors leak, as mentioned above. This adds a static component to power, meaning one that doesn't depend on frequency.
Coming back to your question about why modern processors use so much more power than everything else... there are lots of factors at work. For one thing, your desktop PC is a LOT more complicated than the chips in a cell phone - in a single clock cycle, your desktop can do a huge amount of work, while the chip in your phone might be able to process one instruction (or even take multiple cycles for each instruction). Another thing is that the chip in your desktop operates at a very high frequency, whereas a cell phone chip probably runs much slower. A Pentium 4 might run at 3GHz, with the ability to do a peak of something like 2 integer instructions and 2 floating point each cycle (I forget the exact number) while the cell phone might run at 100MHz and take 3 cycles to finish a single instruction. Additionally, there are different types of transistors - there are transistors that can switch very very fast, but they leak a lot, and transistors that switch slowly but don't leak much. If you need high-performance computing (like a desktop CPU), you're going to use the fast transistors, at the expense of power. However, if you're designing a cell phone, you'd use the slow transistors to improve battery life.
If you look back to the overclocking discussion, you can see there was a 33% power increase for 10% frequency gain. This also works the other way - if you underclock by 10%, you can save a LOT of power. If your 100W@1.4V 3GHz P4 is run at 1.5GHz, and you decrease it to 1V, your power is going to be .5*.51 = about 25 watts (ignoring leakage). Since your cell phone can be slow, they can run it at a low frequency and voltage to save power.
(How come prescots generate so much more heat than winchesters while having less power?)
This question doesn't really make sense. The temperature should be related to the power, so if you have a lower power chip at a higher temperature, there's something different in the cooling.