Why is the cpu not getting hot when idle?

DrCrap

Senior member
Feb 14, 2005
238
0
0
I know as a fact that the cpu always works (kept busy), i.e. even when you see "system idle process" in the task manager, that doesn't mean the cpu actually idles, but it means that the OS is sending it some dummy process to do while there's nothing else to do.
Given that, why is my cpu temp at say 29c when idle, but 38c under full load, when in fact its been under full load the whole time?
A possible explanation could be that cpu Mhz, or power consumption can be reduced while working on idle process, but I'm not sure.
hope I explained my question well...
 

smack Down

Diamond Member
Sep 10, 2005
4,507
0
0
The dummy process sends the NO-OP instruction. Power consumtion in CMOS increases with the speed circuits switch the no-op instruction causes almost no switching so the temp goes down.
 

itachi

Senior member
Aug 17, 2004
390
0
0
the system idle process is not a dummy process.. it's actually the zero thread for the virtual memory manager. when no processes are in the "run" state, the kernel gives scope to the zero thread and zeroes out pages that were freed so that they can be reused. if there are no pages in the free list, it does as smack down described.. it goes into a loop of no-op instructions until the next time sample.

the main reason why your cpu temp differs by so much between idle states and full load is predominately because of the cache. when you have multiple processes running, you're constantly switching contexts.. swapping out a process and it's current state (the virtual memory space, file descriptors, allowed resources, software interrupt vector, signal masks, etc..) for another process.
 

ForumMaster

Diamond Member
Feb 24, 2005
7,792
1
0
itachi is right. for example, there is a program called CpuIdle which i use. it makes task manager show 100 percent cpu usage. what it does is send
an HLT machine instruction (Opcode F4). This tells the cpu clock to halt and thus, the amount of heat that is generated is decreased. You can download this program for a 30 day trial. it rocks! I don't have the top of the line colling system. my average is 40. i started using cpuidle and now my averge is 32-34. It rocks! try it out. it works better then windows because it also optimizes the chipset. anyway, good luck!
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: itachi
the system idle process is not a dummy process.. it's actually the zero thread for the virtual memory manager. when no processes are in the "run" state, the kernel gives scope to the zero thread and zeroes out pages that were freed so that they can be reused.
On what OSes? Can you provide some sources or more info?

the main reason why your cpu temp differs by so much between idle states and full load is predominately because of the cache. when you have multiple processes running, you're constantly switching contexts.. swapping out a process and it's current state (the virtual memory space, file descriptors, allowed resources, software interrupt vector, signal masks, etc..) for another process.
What does the cache have to do with context switching? Caches are very low power structures, despite their large size. Why would context switches be a big contributor to power consumption? I'd think that actual work done by the CPU is the main factor - when it's doing computations, lots of logic switches, so the power goes up. Context switches take on the order of microseconds, and process timeslices are on the order of milliseconds, so the vast majority of the time is definitely not spent doing context switches.
 

ForumMaster

Diamond Member
Feb 24, 2005
7,792
1
0
On what OSes? Can you provide some sources or more info?
The system idle process has benn in existence ever since Windows NT. in other words, windows 2000 and xp have it. if you goto task manager and turn everything off, and i mean this literally including explorer, the system idle process, which u can not turn off, will start taking "up" the cpu time. This is how it does it.
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
Originally posted by: smack Down
The dummy process sends the NO-OP instruction. Power consumtion in CMOS increases with the speed circuits switch the no-op instruction causes almost no switching so the temp goes down.

Close, but not quite.

The idle process is executing the HLT instruction not NOP. HLT makes newer CPUs (anything from 486SL forward actually) power down until woken by an external interrupt. That's why they're cooling down.
 

velis

Senior member
Jul 28, 2005
600
14
81
Regardless of the actual instruction sent to the proc, it's the inactivity of execution units that makes the proc go cooler. The biggest heat generators in a proc are ALUs and FPUs, maybe even some other parts of it. By issuing a NOP or HLT or whatever, we cause the proc to actually do no work. The only heat it generates this way is decoding and "executing" this instruction, but that takes only a very small fraction of processors transistors. Some intensive FP calculations with lots of memory accesses on the other hand cause a large portion of the processor to work hard. That's one of the reasons prime95 is considered a good overclocker's testing tool ;)
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
No, it's different. The HLT instruction actually does cause an explicit powerdown into so-called "suspend-on-halt" state, plus an actual standstill - the HLT instruction does not complete until an external interrupt occurs. This is very very different from a loop of NOPs.
 

smack Down

Diamond Member
Sep 10, 2005
4,507
0
0
Originally posted by: velis
Regardless of the actual instruction sent to the proc, it's the inactivity of execution units that makes the proc go cooler. The biggest heat generators in a proc are ALUs and FPUs, maybe even some other parts of it. By issuing a NOP or HLT or whatever, we cause the proc to actually do no work. The only heat it generates this way is decoding and "executing" this instruction, but that takes only a very small fraction of processors transistors. Some intensive FP calculations with lots of memory accesses on the other hand cause a large portion of the processor to work hard. That's one of the reasons prime95 is considered a good overclocker's testing tool ;)

Just to note that the decode stage is one of the hottest parts of a CPU
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
Although decode could be a hot stages on some chips, it's not even in the top 5 on the last couple of chips that I've worked on. On ours, it's the registers, the integer execution unit, and the floating execution unit that generate the most heat... when they are active.

There are a number of ways that a CPU can generate less heat when idling, but they all mostly rely on some trick to not fire the clocks to a particular section of a CPU, or to not allow a portion of a chip to attempt to change state. The key here is that, if you can figure out that nothing is going to happen in the next clock cycle, then you can send a signal that prevents the clock from even firing. On the designs that I have worked on this was often done using "clock gating" in which, essentially, the clock is AND'd (or more commonly, NAND'd) with a signal that says, essentially, "do something this cycle". The hardest part of this technique is generating a signal that knows what will be done in the next cycle early enough so that you can shove it into a AND gate to prevent the next clock from firing for a particular block or pipeline state. This has been, in my experience, a fairly hard thing to do in the timing constraints so that it's a bit of a lossy guess (the clock would fire occassionally when it wouldn't need to for corner cases), but in general you can eliminate a lot of activity with this method. Even if clock gating specifically isn't used, the general idea of having a signal which "gates" execution of a section of logic frequently is.

Why does preventing the clocks from firing result in less heat? On modern CPU's the clock delivery system can use as much as 70% of the total power dissipation of the CPU. And also on static CMOS logic, in an idealized case, it only generates heat when the transistors are switching. Even in the non-ideal world that we live in on a 90nm process, most of the energy lost is due to switching. So if you stop the clock signal from going anywhere you save the power from the clock itself and you prevent logic from trying to update itself.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
On P4, the schedulers and execution stacks were the hottest part of the chip. On P-M it seems to be the rename/schedule sections.
 

itachi

Senior member
Aug 17, 2004
390
0
0
Originally posted by: CTho9305
On what OSes? Can you provide some sources or more info?
windows, freebsd, and linux that i know of.
under linux, kswapd is the thread that handles freeing pages.. it's defined in vmscan.c. if a process requests more frames than are available, the system wakes up kswapd and has it focus on a specific zone.
What does the cache have to do with context switching? Caches are very low power structures, despite their large size. Why would context switches be a big contributor to power consumption? I'd think that actual work done by the CPU is the main factor - when it's doing computations, lots of logic switches, so the power goes up. Context switches take on the order of microseconds, and process timeslices are on the order of milliseconds, so the vast majority of the time is definitely not spent doing context switches.
in my defense.. i was half-asleep when i wrote that. but then again, that doens't make me any less wrong heh.
i always thought that the cache was the primary cause of heat.. getting a cache miss and having to swap out a block and flushing all the caches on context switches. guess not.
The hardest part of this technique is generating a signal that knows what will be done in the next cycle early enough so that you can shove it into a AND gate to prevent the next clock from firing for a particular block or pipeline state. This has been, in my experience, a fairly hard thing to do in the timing constraints so that it's a bit of a lossy guess (the clock would fire occassionally when it wouldn't need to for corner cases).
what makes it so difficult? i wouldn't think that a single gate could contribute that much to a delay.. or would it be computing the condition?
wait.. when you say "prevent the next clock from firing", are you talking about stalling?
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
It is hard because you have to compute the powerdown in time to set up to the next clock rising, plus skew and jitter, and there is no transparency because a clock glitch is a failure. In practice they tend to be "phase paths" where you only have half a clock cycle minus the clock skew to compute the signal and get it to all its receivers.

Clock gating can be used to implement a pipeline stall, but it is not really a stall per se, because a stall is where valid data is retained in a pipestage for some reason, whereas a clock gate means the data is irrelevant to begin with and it does not matter if the clock toggles to capture new data. That is one way to look at it.
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
what makes it so difficult? i wouldn't think that a single gate could contribute that much to a delay.. or would it be computing the condition?
wait.. when you say "prevent the next clock from firing", are you talking about stalling?
Generally you want to gate the clock several stages of clock buffering back - because most of the power in the clock system occurs in the last 2-3 stages of buffering. So you want to gate the clock several logic stages back - so, to get technical, this means your setup time is much earlier than a typical latch. So the setup to the this clock gate is going to be 10-20% earlier than the setup time to a latch (on a typical multi-gigahertz CPU). But to know whether you want a clock in the next cycle, you need to figure out if you are going to actually going to do something next cycle... a stall condition would be an input to this case, but there are plenty of other things to consider.

For example, I worked on the store buffer in a high-speed cache a couple of chips ago and I had a 3 stage pipeline in my section of unit. So to create the clock gating signal, I needed to figure out if the pipeline would advance in the next cycle. Clearly a stall condition would prevent a pipeline advance - so this is one input - but to truly know if the pipeline would advance, I had to know if the store was going to be allowed into the cache - which required that I know whether there was going to be a load to the same spot, which meant that I had calculate whether a load would happen and what address it would go to. If I had merely needed to calculate this for my own section of the pipeline, life was fine - I had plenty of time left... but to make the setup time to the clock gater, I needed it something like 15% earlier than I needed to make it to the latch - and that's where the fun started. When I realized that I had no hope of making an exact calculation - "am I absolutely certain the pipeline won't advance in the next clock?" - I instead started taking out factors out the equation. So I'd instead do something like "is the load that's pending even remotely close to where I'm going to do a store?" and I'd err in favor of letting the clock fire without knowing precisely what would happen... which wasted power but allowed me to remove several sections of calculating precisely where the load was going and precisely where the store was going.

I remember when I got back the static timing results for the original implementation and saw how badly I was missing timing to the clock gater, I briefly contemplated switching to higher power dynamic logic for the calculation... but then I slapped my forehead. :)

Even just using the stall signals as inputs to the gaters is hard because stalls are also usually long logic chains - at least on the CPU family that I work on - that require inputs from all over the chip and thus have long flight times. And again, to make setup to a clock gate you need to be a fair bit earlier than you would need to be for a typical latch.
 

velis

Senior member
Jul 28, 2005
600
14
81
Originally posted by: Peter
No, it's different. The HLT instruction actually does cause an explicit powerdown into so-called "suspend-on-halt" state, plus an actual standstill - the HLT instruction does not complete until an external interrupt occurs. This is very very different from a loop of NOPs.

But the final result is still inactivity of a very large portion of the CPU, not? More so than with NOP, but both cause inactivity. Which is what I was trying to say.

Originally posted by: smack Down
Just to note that the decode stage is one of the hottest parts of a CPU
That's why the temp doesn't drop to case temp ;)

The method that causes inactivity is irrelevant to the question.
The correct answer to the originally posted question is: inactivity of parts of CPU / entire CPU depending on the implementation. But it is inactivity. A NOP surely uses less transistors to complete than a FDIV. According to Peter, HLT does so even more, but I wouldn't really know since I didn't study proc design and implementation ;)
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
I'll say it once again: Inactivity does not equal powerdown state. Just like an idling car engine consumes more fuel than one that is actually off, making the CPU _know_ it's not needed _at_all_ at the moment (by letting it hit a HLT) produces a very different result to just letting it sit idle and figure out by itself that it's not actually doing something useful. pm's explanations should have shown you why this is different, and why the former saves more power.

All NT-based Windows flavors use it, Linux ever did, most embedded OSes do so to. Everything built on DOS (up to Windows ME) doesn't.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: ForumMaster
On what OSes? Can you provide some sources or more info?
The system idle process has benn in existence ever since Windows NT. in other words, windows 2000 and xp have it. if you goto task manager and turn everything off, and i mean this literally including explorer, the system idle process, which u can not turn off, will start taking "up" the cpu time. This is how it does it.
I was questioning the zeroing of memory, not the existence of the process, since I'd never heard that it did that, and couldn't find anything by googling.
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
I'll say it once again: Inactivity does not equal powerdown state. Just like an idling car engine consumes more fuel than one that is actually off, making the CPU _know_ it's not needed _at_all_ at the moment (by letting it hit a HLT) produces a very different result to just letting it sit idle and figure out by itself that it's not actually doing something useful.
I absolutely agree; telling the CPU it's not needed will result in much lower power than the CPU trying to guess that it's not needed.
 

itachi

Senior member
Aug 17, 2004
390
0
0
ahh ok. i think i understand.. basically, you're not trying to prevent the clock from switching at the pipeline, rather from switching at the buffers?

if both the load and store instructions go to the same execution core, couldn't you just use the condition that prevents the load from proceeding to push the pipeline through? or am i talking about the same thing..?
also, if u don't mind me asking.. what exactly was it that was being pipelined? the loading/storing (into the buffer), decoding, etc..; or just the buffer? i can't see where a pipeline would be beneficial in this scenario (not implying that i would even if it were blatantly obvious hahah).. if a load instruction comes right after a store, i'd assume the load would have to wait an extra 3 cycles while the store propagates through the pipeline.. wouldn't that hurt the overall performance of the cache?

on a different subject.. how come pass-transistor logic isn't used more for cpu designs? from what i've read it requires less transistors to implement the same logic, and it can run on lower operating voltages.. the only issues that i can think of is that it hasn't been used as widely as cmos, and that it's dynamic (that alone would drive me crazy)..
Originally posted by: CTho9305
I was questioning the zeroing of memory, not the existence of the process, since I'd never heard that it did that, and couldn't find anything by googling.
think about the kinds of security holes that would leave in a system.. if the memory isn't zeroed out, a process that was given the page for another process could access the other's memory space by dumping the page after it gets it.
http://msdn.microsoft.com/library/defau...ary/en-us/dngenlib/html/msdn_ntvmm.asp
http://msdn.microsoft.com/library/defau...dllproc/base/scheduling_priorities.asp
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
on a different subject.. how come pass-transistor logic isn't used more for cpu designs? from what i've read it requires less transistors to implement the same logic, and it can run on lower operating voltages.. the only issues that i can think of is that it hasn't been used as widely as cmos, and that it's dynamic (that alone would drive me crazy)..
Lots of logic styles are used, but the majority is done in CMOS, probably because it's so much more robust than everything else. It gives you rail-to-rail output, and can handle relatively large amounts of noise on the inputs. The allowable noise margins for other logic styles I've seen are significantly lower than for CMOS. Also, because of variability and leakage, various circuit styles that used to work really well work less well now (for ordinary dynamic domino logic, variability and high leakage reduce the advantage due to the extra constraints they add).

If you consider a hypothetical domino logic circuit, you need to size the devices such that the leakage of the pull-down network when it is off does not result in the output dropping, and you need to make sure that the weakest pull-down path is not overpowered by leakage through the p-device. If you solve the first problem with a "keeper" device (a p-fet that can be active during the evaluate phase), you exacerbate the second problem. Figure 1 here shows a dynamic gate with a keeper. If you increase the size of the pull-down devices, you need a bigger keeper.

think about the kinds of security holes that would leave in a system.. if the memory isn't zeroed out, a process that was given the page for another process could access the other's memory space by dumping the page after it gets it.
Of course. But you could use those pages as disk cache instead, and demand-zero them. The first link you gave does say it keeps a few free pages available and slowly zeros them. Is that definitely done as the System Idle Process though, and not as System?
 

itachi

Senior member
Aug 17, 2004
390
0
0
Originally posted by: CTho9305
Lots of logic styles are used, but the majority is done in CMOS, probably because it's so much more robust than everything else. It gives you rail-to-rail output, and can handle relatively large amounts of noise on the inputs. The allowable noise margins for other logic styles I've seen are significantly lower than for CMOS. Also, because of variability and leakage, various circuit styles that used to work really well work less well now (for ordinary dynamic domino logic, variability and high leakage reduce the advantage due to the extra constraints they add).

If you consider a hypothetical domino logic circuit, you need to size the devices such that the leakage of the pull-down network when it is off does not result in the output dropping, and you need to make sure that the weakest pull-down path is not overpowered by leakage through the p-device. If you solve the first problem with a "keeper" device (a p-fet that can be active during the evaluate phase), you exacerbate the second problem. Figure 1 here shows a dynamic gate with a keeper. If you increase the size of the pull-down devices, you need a bigger keeper.
what about mixing and matching? would it be possible to use cmos logic for one one part without having to actually implement the whole thing in it?
Of course. But you could use those pages as disk cache instead, and demand-zero them. The first link you gave does say it keeps a few free pages available and slowly zeros them. Is that definitely done as the System Idle Process though, and not as System?
windows has some physical memory mapped for that kinda stuff.. i don't know too much about it though, only that it exists.
the zero thread runs while the system is idle so that if a process requests a page, it won't have to wait while it's zeroed out. it can be called on a demand-basis too though.. given that the zeroed list is empty, the mm looks to the free lists.. and if that list is empty too, it looks for a page to swap out.

yea, it's the system idle process. the system process is used to handle kernel-mode threads (drivers fall under this group). the second link i gave says that the zero-page thread is the only thread that can be scheduled with a priority of 0. if any thread is placed inside the run queue, it'll get prioritized before the zero-page thread.. unless it was specifically called. the zero page thread is a system thread, not a kernel-mode or user-mode thread.. which is why you get 'N/A' under base priority.

this is the best that i can do.
http://www.liutilities.com/products/win...processlibrary/%5Bsystem%20process%5D/
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: itachi
Originally posted by: CTho9305
Lots of logic styles are used, but the majority is done in CMOS, probably because it's so much more robust than everything else. It gives you rail-to-rail output, and can handle relatively large amounts of noise on the inputs. The allowable noise margins for other logic styles I've seen are significantly lower than for CMOS. Also, because of variability and leakage, various circuit styles that used to work really well work less well now (for ordinary dynamic domino logic, variability and high leakage reduce the advantage due to the extra constraints they add).

If you consider a hypothetical domino logic circuit, you need to size the devices such that the leakage of the pull-down network when it is off does not result in the output dropping, and you need to make sure that the weakest pull-down path is not overpowered by leakage through the p-device. If you solve the first problem with a "keeper" device (a p-fet that can be active during the evaluate phase), you exacerbate the second problem. Figure 1 here shows a dynamic gate with a keeper. If you increase the size of the pull-down devices, you need a bigger keeper.
what about mixing and matching? would it be possible to use cmos logic for one one part without having to actually implement the whole thing in it?
One part of a gate? No. One part of a larger circuit? Yes, sometimes. One problem is that dynamic gates cannot handle glitches on their inputs, and CMOS logic loves to create glitches, so you can't feed CMOS outputs to dynamic inputs directly. You could drive CMOS with a dynamic gate though.


Of course. But you could use those pages as disk cache instead, and demand-zero them. The first link you gave does say it keeps a few free pages available and slowly zeros them. Is that definitely done as the System Idle Process though, and not as System?
windows has some physical memory mapped for that kinda stuff.. i don't know too much about it though, only that it exists.
the zero thread runs while the system is idle so that if a process requests a page, it won't have to wait while it's zeroed out. it can be called on a demand-basis too though.. given that the zeroed list is empty, the mm looks to the free lists.. and if that list is empty too, it looks for a page to swap out.

yea, it's the system idle process. the system process is used to handle kernel-mode threads (drivers fall under this group). the second link i gave says that the zero-page thread is the only thread that can be scheduled with a priority of 0. if any thread is placed inside the run queue, it'll get prioritized before the zero-page thread.. unless it was specifically called. the zero page thread is a system thread, not a kernel-mode or user-mode thread.. which is why you get 'N/A' under base priority.

this is the best that i can do.
http://www.liutilities.com/products/win...processlibrary/%5Bsystem%20process%5D/
Cool.
 

OCedHrt

Senior member
Oct 4, 2002
613
0
0
Slightly off topic, but for some reason my P-M idles hotter but I get longer battery life (5 hrs) in WinXP vs idling cooler but less battery life (< 3 hrs) in a different OS. It ran so hot (50C+) in WinXP I had to install a program to underclock and undervolt the P-M when idle (~35C). In the other OS I'd say it was doing about 35C as well.