Overvolting damages CPU instead of High Temps?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

genec57

Member
Nov 7, 2006
135
0
0
There was a recent thread in another forum asking for reports of CPU failure as a result of overclocking and there were only a number of replies attesting to heavy overclocking with no apparent damage.
I seems to me that if you keep heat under control especially with water cooling that you can push until you can't take the fsb any further with very little real chance of damage. If there were no risk that would spoil half the fun of it.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
Considering that the 8Ghz OC, involving a Cedar Mill P4 chip used 1.9v vcore, and the chip seems healthy still... well, it kind of slants the heat/voltage-is-killer equation more to the heat side for me. Perhaps some of these aweful things like electromigration happen, not only due to voltage, but also in the presence of heat.
 

SerpentRoyal

Banned
May 20, 2007
3,517
0
0
Originally posted by: CTho9305
Originally posted by: genec57
I would love to see something definitive on the subject.

Manufacturers do extensive analysis to predict lifetimes under all sorts of conditions. The problem is that nobody outside of industry is willing to buy a decent sample size of CPUs and do meaningful experiments and keep them running for a couple years. It's no fun, and requires a considerable investment of time, storage space, and money. And, of course, once your study is done, people will write it off as meaningless since it was done with ancient 2-3 year old CPUs ;). It's much more fun to use unscientific anecdotal evidence and keep propagating what are effectively myths.

Originally posted by: SerpentRoyal
An hour or two at 1.5V is safe as long as the temperature is 25C lower than the Tjunction temperature. A good water-cooled rig can absorb up to about 1.6V with a willing CPU. Again, I would rely on C1E and EIST to lower the voltage when it is not needed (ide or moderate load). CPU should last at least 4 years.

Personally, I think running a 65nm CPU at 1.6V is insane no matter what your cooling is.

I suspect high-end PC vendors are probably pushing at least this much voltage with their water-cooled rig to break the 3.6GHz barrier. They are probably bin-sorting chips, too. BTW, I remember reading somewhere about these rigs going out the door with 3 year warranty.

 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
In my experience, small voltage increases are worse for the long-term reliability of a microprocessor (or any other CMOS VLSI integrated circuit) than heat.

If you have a CPU that's running at 1.2V and 40 Celsius (on-die temp) and increase either the temperature or the voltage by 50% - so 1.8V and 60C - and leave everything else equal, the increased 1.8V voltage will have a vastly higher impact on long-term reliability of the CPU than the temperature. As long as you keep the CPU temperature lower than the maximum temperature of the design (usually higher than 70 Celsius, often above 100 Celsius), the long-term reliability impact of a small percentage increased heat is minimal. Even a small amount of increased voltage on the other hand, can have a profound impact on long-term reliability.

To drop into the more esoteric discussion of why this is the case, let's start with what causes failures. There are numerous failure mechanisms that cause CPU's to fail over time. Among these, the most common are:
Electromigration (EM): http://en.wikipedia.org/wiki/Electromigration
Hot-electron gate ("hot-e"): http://siliconfareast.com/hotcarriers.htm
Time dependent dielectric breakdown (TDDB): http://siliconfareast.com/oxidebreakdown2.htm
Bond/solder failures (including fatique failures): http://siliconfareast.com/relmodels3.htm
Bias Temperature Instability (BTI): http://cobweb.ecn.purdue.edu/~...tutorial-nbti-alam.pdf

There are others, but these are the common ones nowadays. There's a good list at the siliconfareast.com site that I listed above.

This subject is a complex one, but one thing that you can quickly pick up by glancing over the links above is that temperature is not a big lever in causing many of these problems - but voltage is. The equations for failure in hot-E, TDDB and BTI don't include temperature at all - or if it is, it's a 2nd order effect, while voltage is a huge lever - often the square of the voltage is an input. Hot-E electron failures are actually worse at lower temperatures than higher.

Which issue of the list above is likely to kill a given CPU depends on the process technology of the company that fabricated the CPU and the microprocessor's circuit and layout design. For one CPU, the most common failure mechanism might be electromigration, for another, the interconnect process used in the CPU manufacturing process might be thicker, or contain more copper atoms, and so it might be something else - like TDDB.

Also, as several posters mentioned above, voltage has a huge impact on temperature. The simplified formula for the dynamic power of a CPU is P = Cf(V^2) - where P is the CPU power, C is the on-die capacitance that needs to be switched, f is the frequency of the clock, and V is voltage. Note that it's the square of the voltage that is calculated in... so increasing the voltage just a little raises the power of the CPU (and thus the temperature all things being equal) by the square of the voltage, but increasing the frequency only has a linear effect. Voltage also has a large impact on static power (ie. leakage).

So even for something like electromigration - which is dependent on the current density (which is dependent on the voltage), as well as temperature, then increasing the voltage will both increase the temperature (due to increased power) and the current density - while increasing the temperature only increases the temperature.

Above a certain temperature, however, some of the organic compounds used in the manufacturing of the CPU start to break down. For example, the polyimide layer used in passivation starts to breakdown between 110 and 135 Celsius. So if you start to get above 125C, you will essentially start to "burn up" the CPU. I don't know what the breakdown temperature of the resin underfilled used on BGA packages is, but based on my knowledge of the composition, that should have breakdown temperature between 120 and 150 Celsius. The OLGA packages also will have a fairly "low" breakdown temperature (relative to the silicon anyway).

As far as specific examples involving CPU's like a Cedarmill running at 1.9V. I find it extremely hard to believe that someone could run a 65nm microprocessor at 1.9V for more than a couple of months continuous use. The mean-time-to-failure (MTTF) on a 65nm microprocessor at 1.9V should be extremely short based on my experience.

Patrick Mahoney
Senior Design Engineer
Intel Corp.

 

graysky

Senior member
Mar 8, 2007
796
1
81
@PM: Great post, dude. I've been posting that "dynamic power" formula for a while now in the context of heat production. I guess with all you said in mind, the question becomes not IF when WHEN. IN other words, if people are on a 2 year replacement cycle, will the hardware last before it fails given overclocked voltage, FSB, etc.? (That's a philosophical question, not one directed at you.)
 

Diogenes2

Platinum Member
Jul 26, 2001
2,151
0
0
Originally posted by: pm
In my experience, small voltage increases are worse for the long-term reliability of a microprocessor (or any other CMOS VLSI integrated circuit) than heat.

-------------------

Patrick Mahoney
Senior Design Engineer
Intel Corp.
What exactly is ' your experience ' with over-volting test beds .

What's ' long term ' ?

In general, the people who ' over-volt ' in the enthusiest communities are not people who keep a CPU very long ..

Intel has been practicing ' overvolting ' themselves.

Many of their higher clocked CPU's from the P3 onward, just had a higher vcore than the
slower parts..

They don't seem to be ( and understandably so ) very forthcoming about what an upper safe limit ( for long term reliability ) is for a particular fab ..


I've never really liked the term ' overclocked '..

When a part is really overclocked, it don't work .. :D


I think ' maximized ' is a better term.
 

apoppin

Lifer
Mar 9, 2000
34,890
1
0
alienbabeltech.com
Originally posted by: pm
In my experience, small voltage increases are worse for the long-term reliability of a microprocessor (or any other CMOS VLSI integrated circuit) than heat.

If you have a CPU that's running at 1.2V and 40 Celsius (on-die temp) and increase either the temperature or the voltage by 50% - so 1.8V and 60C - and leave everything else equal, the increased 1.8V voltage will have a vastly higher impact on long-term reliability of the CPU than the temperature. As long as you keep the CPU temperature lower than the maximum temperature of the design (usually higher than 70 Celsius, often above 100 Celsius), the long-term reliability impact of a small percentage increased heat is minimal. Even a small amount of increased voltage on the other hand, can have a profound impact on long-term reliability.

To drop into the more esoteric discussion of why this is the case, let's start with what causes failures. There are numerous failure mechanisms that cause CPU's to fail over time. Among these, the most common are:
Electromigration (EM): http://en.wikipedia.org/wiki/Electromigration
Hot-electron gate ("hot-e"): http://siliconfareast.com/hotcarriers.htm
Time dependent dielectric breakdown (TDDB): http://siliconfareast.com/oxidebreakdown2.htm
Bond/solder failures (including fatique failures): http://siliconfareast.com/relmodels3.htm
Bias Temperature Instability (BTI): http://cobweb.ecn.purdue.edu/~...tutorial-nbti-alam.pdf

There are others, but these are the common ones nowadays. There's a good list at the siliconfareast.com site that I listed above.

This subject is a complex one, but one thing that you can quickly pick up by glancing over the links above is that temperature is not a big lever in causing many of these problems - but voltage is. The equations for failure in hot-E, TDDB and BTI don't include temperature at all - or if it is, it's a 2nd order effect, while voltage is a huge lever - often the square of the voltage is an input. Hot-E electron failures are actually worse at lower temperatures than higher.

Which issue of the list above is likely to kill a given CPU depends on the process technology of the company that fabricated the CPU and the microprocessor's circuit and layout design. For one CPU, the most common failure mechanism might be electromigration, for another, the interconnect process used in the CPU manufacturing process might be thicker, or contain more copper atoms, and so it might be something else - like TDDB.

Also, as several posters mentioned above, voltage has a huge impact on temperature. The simplified formula for the dynamic power of a CPU is P = Cf(V^2) - where P is the CPU power, C is the on-die capacitance that needs to be switched, f is the frequency of the clock, and V is voltage. Note that it's the square of the voltage that is calculated in... so increasing the voltage just a little raises the power of the CPU (and thus the temperature all things being equal) by the square of the voltage, but increasing the frequency only has a linear effect. Voltage also has a large impact on static power (ie. leakage).

So even for something like electromigration - which is dependent on the current density (which is dependent on the voltage), as well as temperature, then increasing the voltage will both increase the temperature (due to increased power) and the current density - while increasing the temperature only increases the temperature.

Above a certain temperature, however, some of the organic compounds used in the manufacturing of the CPU start to break down. For example, the polyimide layer used in passivation starts to breakdown between 110 and 135 Celsius. So if you start to get above 125C, you will essentially start to "burn up" the CPU. I don't know what the breakdown temperature of the resin underfilled used on BGA packages is, but based on my knowledge of the composition, that should have breakdown temperature between 120 and 150 Celsius. The OLGA packages also will have a fairly "low" breakdown temperature (relative to the silicon anyway).

As far as specific examples involving CPU's like a Cedarmill running at 1.9V. I find it extremely hard to believe that someone could run a 65nm microprocessor at 1.9V for more than a couple of months continuous use. The mean-time-to-failure (MTTF) on a 65nm microprocessor at 1.9V should be extremely short based on my experience.

Patrick Mahoney
Senior Design Engineer
Intel Corp.

Originally posted by: apoppin
we DID discuss this years ago with PM - an elite member who really knew his stuff working as a CPU engineer for intel and he really did know about OC'ing. Unfortunately this info's details are hidden in my brain and in bookmarks from probably 4 years ago.
speak of the de ... Hi Patrick!
:)

it is really good to have you weigh back in here again ... i remember most of our discussions of 3-5 years ago and i had lost most of my bookmarks ... well, buried away. Glad to have a refresher! :p

--as to "over clocked" ... it is "clocked over" the stock speeds
 

idiotekniQues

Platinum Member
Jan 4, 2007
2,572
0
76
Originally posted by: SerpentRoyal
An hour or two at 1.5V is safe as long as the temperature is 25C lower than the Tjunction temperature. A good water-cooled rig can absorb up to about 1.6V with a willing CPU. Again, I would rely on C1E and EIST to lower the voltage when it is not needed (ide or moderate load). CPU should last at least 4 years.

i enabled c1e and EIST and it did not affect voltages (in speedfan) but i did notice the chip clock itself down via cpuz.

i more wanted it for the voltage decrease though.

any suggestions?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
IN other words, if people are on a 2 year replacement cycle, will the hardware last before it fails given overclocked voltage, FSB, etc.? (That's a philosophical question, not one directed at you.)
As long as people understand the risks, then they should do whatever they want with the CPU that they have. I remember people complaining about "Sudden Northwood Death Synrome" - where their Northwood CPU would just suddenly stop working. So even people on a 2 year replacement plan can run into issues.

What exactly is ' your experience ' with over-volting test beds .
I spent over a year doing electrical marginality testing on several steppings of the original Pentium (P54CS) which included high-temperature and high-voltage marginality debugging - primarily on functional testers, but also some system-based work. And then I've spent 3 years working on electrical marginality testing on high-end server microprocessors - mostly on functional pin-testers. I have run a lot of shmoos on a lot of parts over the years... and nearly all of these shmoos were pushed up into the elevated voltage range.

In a nutshell, I think I can safely say that I have directly destroyed more CPU's by running them at high voltage and high temps than anyone else at Anandtech.

What's ' long term ' ?
Good point. When I say long term, I generally mean more than 5 years. And, since I work in electrical marginality on servers, often more than 10 years. I know this is "really, really long term" to most AT'ers. Still, the farther you leave the spec, the more likely you are to run into problems sooner. But this is a good point - my "long term" is more long term than most on here would care about.

Intel has been practicing ' overvolting ' themselves. Many of their higher clocked CPU's from the P3 onward, just had a higher vcore than the slower parts. They don't seem to be ( and understandably so ) very forthcoming about what an upper safe limit ( for long term reliability ) is for a particular fab.
Let me see if I can figure out what Intel has publicly disclosed about our multiVID test techniques and then, if I can, I'll comment on this.

They don't seem to be ( and understandably so ) very forthcoming about what an upper safe limit ( for long term reliability ) is for a particular fab ..
Our fabs are "copy-exact" and so there shouldn't be a particular value for a given fab. But if by "fab" you mean process technology, then you are right, reliability vs. voltage numbers are rarely disclosed (although I would think that if you search IEEEXplore, one might see Intel authors of papers discussing this... but I'm not sure and don't have time to check).

Hey, Apoppin. :) I never really left. I lurk. A lot. :)