Microsoft research on hardware failures.

ShintaiDK · Jul 5, 2012

Its from 2011. But still interresting. Not sure if it has been posted before.

1 million PCs examined. Overclocking the CPU adds 4-20x higher chance of an OS crash. Underclocking reduces it with 40-80%.

http://research.microsoft.com/pubs/144888/eurosys84-nightingale.pdf

philipma1957 · Jul 5, 2012

Nice article it did not spell out the i5 2500t or the i7 3770t as a factory underclocked chip . But I think of both cpus as underclocked super stable cpus. My interruption of this article is that if stability is important those t chips are where it is at.

I read it quickly and laptops crash less then desktops (maybe oc of desktops are much higher)

The other point was hdds die off 2x quicker then rated MTTF. other info 2% of 480,000 cpus were oc'ed and one vendor had a 20x chance of a crash vs the other vendor of a cpu has a 4x chance of a crash. compared to no oc cpus. well I am trying to figure which one it Intel the 20x or the 4x, this is a very interesting number.

pantsaregood · Jul 5, 2012

philipma1957 said:
Nice article it did not spell out the i5 2500t or the i7 3770t as a factory underclocked chip . But I think of both cpus as underclocked super stable cpus. My interruption of this article is that if stability is important those t chips are where it is at.

I read it quickly and laptops crash less then desktops (maybe oc of desktops are much higher)

The other point was hdds die off 2x quicker then rated MTTF. other info 2% of 480,000 cpus were oc'ed and one vendor had a 20x chance of a crash vs the other vendor of a cpu has a 4x chance of a crash. compared to no oc cpus. well I am trying to figure which one it Intel the 20x or the 4x, this is a very interesting number.

I wouldn't assume either of them were "super stable," as they run at a lower voltage than the non-T/S chips. Stability gain would likely be most significant on slightly overvolted units that have been underclocked by a relatively significant amount.

There's some extremely good research in here. When I opened this, I didn't expect it to be so in-depth, nor did I expect such a lack of bias or error. Usually "research" is good for little more than pointing you in the ballpark direction of truths, but this pretty well isolates every possible factor within reason.

Zap · Jul 5, 2012

That looks really neat. I just skimmed first page, but will read at my leisure later. Thanks!

IGemini · Jul 5, 2012

Good stuff. Marked for later reading.

borisvodofsky · Jul 5, 2012

Why do you guy bother reading this, when you KNOW, that no matter what number they present, you're STILL going to overclock?

LOLOLOL

philipma1957 · Jul 5, 2012

borisvodofsky said:
Why do you guy bother reading this, when you KNOW, that no matter what number they present, you're STILL going to overclock?

LOLOLOL

some have two systems like I do.

oh I oc the 2500k to 4.2 , and the hd6870 card inside it by 10%

but i feel a lot better about the i7 3770t after reading this.

Idontcare · Jul 5, 2012

ShintaiDK said:
Its from 2011. But still interresting. Not sure if it has been posted before.

1 million PCs examined. Overclocking the CPU adds 4-20x higher chance of an OS crash. Underclocking reduces it with 40-80%.

http://research.microsoft.com/pubs/144888/eurosys84-nightingale.pdf

Gotta love those soft-errors and silent corruption. Basically it doesn't matter who is doing the "over" clocking - be it the CPU maker during binning or the end-user while OC'ing. Clocking too high makes an unstable system.

Borealis7 · Jul 5, 2012

didn't read the article...but i bet it's still a lot less than crashes originating from nVidia and AMD drivers.

KingFatty · Jul 5, 2012

So this study is based on data sets from the Windows Error Reporting (WER) system. I wonder if the study is affected by the kinds of people who click "submit" when presented with the WER prompt vs. the kinds of people who don't? I was concerned that maybe the data is skewed by overclockers who intentionally crash their systems to find overclocking limitations (e.g., an overclocker would fall into the pattern of a machine that suffers a failure repeatedly, which is similar to a conclusion the study found where a system that crashes once is likely to crash again). But, maybe overclockers won't be clicking the submit button on the WER prompt therefore keeping this study's source data relatively unbiased? I didn't read the whole study carefully enough to see how they dealt with these factors.

Homeles · Jul 5, 2012

Borealis7 said:
didn't read the article...but i bet it's still a lot less than crashes originating from nVidia and AMD drivers.

I've bumped the AMD driver crash count up significantly.

AsusGuy · Jul 5, 2012

Interesting article, although I have rarely experienced a CPU issue that was caused by overclocking so I don't think this will affect my OC habits much. CPU failures in any system non OC or OC seem so minor I feel like its a moot point.

borisvodofsky · Jul 5, 2012

AsusGuy said:
Interesting article, although I have rarely experienced a CPU issue that was caused by overclocking so I don't think this will affect my OC habits much. CPU failures in any system non OC or OC seem so minor I feel like its a moot point.

The reason for the low failure rate is because most people just browse the internet and watch porn.

Overclocking has been proven to be perfectly porn stable.

TuxDave · Jul 5, 2012

Idontcare said:
Gotta love those soft-errors and silent corruption. Basically it doesn't matter who is doing the "over" clocking - be it the CPU maker during binning or the end-user while OC'ing. Clocking too high makes an unstable system.

Soft error rate is the bane of my existence. As we start packing more and more transistors into a smaller area, the spec to protect against SER for each core gets higher and higher.

piasabird · Jul 5, 2012

Must be all that lousy google software.

peonyu · Jul 5, 2012

It just confirms what overclockers have known since...Well, overclocking first started. Run programs to test stability if you overclock, up the vcore if its not stable. Test your RAM, check your cooling. Rinse and repeat. Even then a overclocked setup wont be as stable as a non-overclocked unit but if done properly and tested I would hardly call it unstable.

Of course in Microsoft's case its anyones guess if they upped the voltage on their tests and properly overclocked. Its really not in their best interest to do so anyways, they sell software and most [casual] people who overclock dont test their system out properly so im sure it does crash alot. Microsoft likely recieves alot of tech calls for support about the OS crash as though as its their fault for the crash when its not...Ocing is not something they want to encourage.

Kristijonas · Jul 5, 2012

KingFatty said:
So this study is based on data sets from the Windows Error Reporting (WER) system. I wonder if the study is affected by the kinds of people who click "submit" when presented with the WER prompt vs. the kinds of people who don't? I was concerned that maybe the data is skewed by overclockers who intentionally crash their systems to find overclocking limitations (e.g., an overclocker would fall into the pattern of a machine that suffers a failure repeatedly, which is similar to a conclusion the study found where a system that crashes once is likely to crash again). But, maybe overclockers won't be clicking the submit button on the WER prompt therefore keeping this study's source data relatively unbiased? I didn't read the whole study carefully enough to see how they dealt with these factors.

I think overclockers are people who are more seldom to send reports than regular/business users. I think what adds to the plausibility of the theory of clock/crash connection is the stability of underclocked systems. It proves that underclocked systems (most of them are untampered, unlike overclocked systems) are more stable than regular systems, which explicitly shows a connection between clock and crash rate.

Anyway, great find, ShintaiDK!

Subyman · Jul 5, 2012

I think the results may be skewed. If Windows is sending a report with every BSOD recovery, then I sent them well over 10 when adjusting an overclock on a new chip. I'm sure a lot of anandtech forum goers have sent their fair share of BSOD crashes due to testing OCes and BIOS settings. That doesn't mean we run it like that daily.

soccerballtux · Jul 5, 2012

lot of people overclocking are stability testing though. So of course there are going to be failures.
What would be interesting is number of failures / PC after PC is "stable" IE what's the standard deviation on the failure rate? Obviously one PC fails a lot at first till dude figures out good voltage/frequency.

Ferzerp · Jul 5, 2012

I think the ECC on the desktop zealots need to read this

Fox5 · Jul 6, 2012

Ferzerp said:
I think the ECC on the desktop zealots need to read this

Doesn't it kind of confirm that ECC would be useful?

Bill Brasky · Jul 6, 2012

I particularly enjoyed this tid bit about creating a OS that is hardware fault aware. Pretty neat ideas!

"For example, a hardware-fault-tolerant (HWFT) OS might map out faulty memory locations, just as disks map out bad sectors. In a multi-core system, the HWFT OS could map out intermittently bad cores. More interestingly, the HWFT OS might respond to an MCE by migrating to a properly functioning core, or it might minimize susceptibility to MCEs by executing redundantly on multiple cores. A HWFT OS might be structured such that after boot, no disk read is so critical as to warrant a crash on failure. Kernel data structures could be designed to be robust against bit errors. Dynamic frequency scaling, currently used for power and energy management, could be used to improve reliability, running at rated speed only for performance-critical operations. We expect that many other ideas will occur to operating-system researchers who begin to think of hardware failures as commonplace even on single machines."

bononos · Jul 6, 2012

Fox5 said:
Doesn't it kind of confirm that ECC would be useful?

Thats probably what he meant- that the zealots (his words) will seize this interesting MSoft study and hammer away.

samboy · Jul 6, 2012

soccerballtux said:
lot of people overclocking are stability testing though. So of course there are going to be failures.
What would be interesting is number of failures / PC after PC is "stable" IE what's the standard deviation on the failure rate? Obviously one PC fails a lot at first till dude figures out good voltage/frequency.

Agreed....... my thought also that this would significantly bias things

Ferzerp · Jul 6, 2012

Fox5 said:
Doesn't it kind of confirm that ECC would be useful?

It shows that memory failures are far, far less of an issue than processor and disk issues.

Of course, to the ECC zealots, sure, it will be taken as proof, but it shows just how far down the list their pet problem really is.

Microsoft research on hardware failures.

Lifer

Golden Member

Senior member

Elite Member

Platinum Member

Diamond Member

Golden Member

Elite Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Lifer

Lifer

Platinum Member

Senior member

Moderator <br> VC&G Forum

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member