Do CPUs fail stress tests at stock?

JimmiG · Jul 17, 2013

SOFTengCOMPelec said:
It should cope just fine. E.g. Leave it running for 5 years.

Prime95 is a VERY good example, because some of the people who want to be the first, to find a new giant prime number (mersenne prime), do exactly what you describe.

Really, the above is what prime95 is for, searching for big (mersenne) prime numbers, the cpu stress testing is a by product of that work.

So I assume overclocking is a big NO-NO if you're going to use Prime95 for the intended purpose. I mean, if an overclocked system is stable for 100 days, all you really know is that it's stable for 100 days. Maybe it would have produced errors two minutes into day 101...and all your work up until that point would have been wasted.

24 hours seems to be about the longest most overclockers can be bothered to run it as a stress test. Tying up your main computer for longer than that is really not practical.

SOFTengCOMPelec · Jul 17, 2013

JimmiG said:
So I assume overclocking is a big NO-NO if you're going to use Prime95 for the intended purpose. I mean, if an overclocked system is stable for 100 days, all you really know is that it's stable for 100 days. Maybe it would have produced errors two minutes into day 101...and all your work up until that point would have been wasted.

24 hours seems to be about the longest most overclockers can be bothered to run it as a stress test. Tying up your main computer for longer than that is really not practical.

The "real work" usage of prime95 copes with potential "soft errors", by double checking the results (I have forgotten the exact checking scheme used for prime95).

E.g. It thinks 123456789... is a very big prime (Mersenne) number.
A different computer can double check the result, before it is accepted.

If you want to do mainly "real" prime95 work on a computer, I would recommend NOT overclocking it. But you could still overclock it.

As a rule of thumb, if prime95 has found no problems after 24 hours (exact length of time debatable), you can probably declare your computer fit for purpose.

If I went back to overclocking again, I would use prime95 to find the limit point at which the computer "just" works while overclocking.

E.g. 4.4GHz gives prime95 errors, every hour
4.3GHz gives no errors in prime95, even after 24..48 hours.

Then I would be tempted to back off the overclock down to a safety margin point, such as 4 GHz (or lower, maybe half way between max overclock and default speed), and leave it at that.
I know that the "safety margin" method, is wildly against the philosophy/principles/2nd-amendment-rights/law of MOST keen overclockers.

But I favour stability, lack of silent data corruption/errors and reliability/longevity, over a relatively modest performance loss.

Idontcare · Jul 17, 2013

JimmiG said:
So I assume overclocking is a big NO-NO if you're going to use Prime95 for the intended purpose. I mean, if an overclocked system is stable for 100 days, all you really know is that it's stable for 100 days. Maybe it would have produced errors two minutes into day 101...and all your work up until that point would have been wasted.

24 hours seems to be about the longest most overclockers can be bothered to run it as a stress test. Tying up your main computer for longer than that is really not practical.

Actually the real situation is far worse and far scarier than that.

Statistically your semi-stable rig is just as likely to have an error 5 minutes into a Prime95 test as it is to have an error 5 days into the test.

What changes over time is the likelihood that your computer will have had an error by that point in time in the test.

It is like rolling a dice. The first time you roll it you have a 1 in 6 chance of rolling a 3. The second time you roll it you have a 1 in 6 chance of rolling a 3. The third time you roll it you have a 1 in 6 chance of rolling a 3...

But the odds of having rolled a 3 at some point increases the more times you roll the dice...such that by the time you roll the dice 6 times you will have a 100% likelihood of having rolled a 3 at some point in those 6 rolls.

What this means for our computers is a funny thing, you can try this test (I did, and it works) and see for yourself...dial in an OC on your rig that you know is semi-stable.

Meaning set the Vcore or clockspeed such that you know Prime95 will have an error in a reasonable amount of time, say 30 minutes (you don't want the experiment to take weeks after all), and run Prime95. Write-down how long it took before the first error was detected. Now restart the test and do the same (write down the time to failure).

Do this for a reasonable number of times, say 10 times, and note that your computer doesn't fail at a consistent point in time. Some times it will fail really quickly, other times it takes hours.

And this is the reality of failure rates and statistics. There is no such thing as having a rig that is "24hrs prime stable!"...all that means is it passed 24hrs of prime once (or maybe a couple of times) and the end-user decided that was a good enough sample population for them to conclude the statistics of the parent distribution are such that the user is assured their rig will never crash while running Prime95 for less than 24hrs. (which is false, it has a chance of failing in less than 24hrs, how much chance is unknown because the user failed to adequately characterize the true statistical failure rate of the parent distribution)

And that is where we come full-circle to your example. If an OC'ed system is stable for 100days that means there is a 1%/day rate of failure. On any given day there is a 1% chance that your OC'ed computer is going to fail on that given day.

That means on day 1 it could fail, the chances are small (only 1%) but still not zero. But by day 100 your accumulated likelihood of experiencing a failure at some point between day 1 and day 100 will be 100%. (and you may not experience it even still, as the daily percentage is still only 1% chance of failure on day 101, etc)

SOFTengCOMPelec · Jul 17, 2013

Idontcare said:
Actually the real situation is far worse and far scarier than that.

Statistically your semi-stable rig is just as likely to have an error 5 minutes into a Prime95 test as it is to have an error 5 days into the test.

What changes over time is the likelihood that your computer will have had an error by that point in time in the test.

You remind me of the older days, when I use to overclock a lot.

I can't remember the exact story, but it goes something like this ...

I managed to get a very good overclock running for the technology of that time, so I eagerly watched Prime95, ok, ok, a few hours, ok, ok...

Great it has passed for 3 hours (guestimate), so I get a celebratory coffee (or something).
As my excitement fades when I check the screen again, prime95 = errors 1.

Yes, the statistics are horrible.

I'm very worried if a very long stress test of prime95 fails, after say 36 hours.
Because even if I reduce the overclock, and it runs perfectly fine for 48 hours, maybe it needs a lot longer to fail again.
I.e. The 36 hours could of been a lucky/frequent failure, which normally takes 360 hours to find.

These days I'm so fed up of the hassle, I tend to not overclock.

Ferzerp · Jul 17, 2013

Idontcare said:
But the odds of having rolled a 3 at some point increases the more times you roll the dice...such that by the time you roll the dice 6 times you will have a 100% likelihood of having rolled a 3 at some point in those 6 rolls.

Actually, it's only 66.5%

(5/6)^6 gives a 33.5% chance of no 3. It will approach, but never equal 100%

Idontcare · Jul 17, 2013

Ferzerp said:
Actually, it's only 66.5% (5/6)^6 gives a 33.5% chance of no 3. It will approach, but never equal 100%

Doh! You are of course correct

For anyone else who is curious to know what Ferzerp already knows, the following wall of text will help you understand

Doctor TWE said:
In one roll, the probability of rolling a 6 is 1/6.

For two rolls, there is a 1/6 probability of rolling a six on the first roll.
If this occurs, we've satisfied our condition. There is a 5/6 probability
that the first roll is not a 6. In that case, we need to see if the second
roll is a 6. The probability of the second roll being a 6 is 1/6, so our
overall probability is 1/6 + (5/6)*(1/6) = 11/36. Why did I multiply the
second 1/6 by 5/6? Because I only need to consider the 5/6 of the time that
the first roll wasn't a 6. As you can see the probability is slightly less
than 2/6.

For three rolls, there is a 1/6 probability of rolling a six on the first
roll. There is a 5/6 probability that the first roll is not a 6. In that
case, we need to see if the second roll is a 6. The probability of the second
roll being a 6 is 1/6, giving us a probability of 11/36. There is a 25/36
probability that neither of the first two rolls was a 6. In that case, we
need to see if the third roll is a 6. The probability of the third roll being
a 6 is 1/6, giving us a probability of 1/6 + (5/6)*(1/6) + (25/36)*(1/6) =
91/216. Again, this is less than 3/6.

The general formula for rolling at least one 6 in n rolls is 1 - (5/6)^n.

source

VirtualLarry · Jul 18, 2013

SOFTengCOMPelec said:
I'm very worried if a very long stress test of prime95 fails, after say 36 hours.
Because even if I reduce the overclock, and it runs perfectly fine for 48 hours, maybe it needs a lot longer to fail again.
I.e. The 36 hours could of been a lucky/frequent failure, which normally takes 360 hours to find.

These days I'm so fed up of the hassle, I tend to not overclock.

Try two weeks worth of stability testing, before it errors.

See my old thread about it, "Stable, but not stable enough"

http://forums.anandtech.com/showthread.php?t=266521

JimmiG · Jul 18, 2013

Idontcare said:
Meaning set the Vcore or clockspeed such that you know Prime95 will have an error in a reasonable amount of time, say 30 minutes (you don't want the experiment to take weeks after all), and run Prime95. Write-down how long it took before the first error was detected. Now restart the test and do the same (write down the time to failure).

Do this for a reasonable number of times, say 10 times, and note that your computer doesn't fail at a consistent point in time. Some times it will fail really quickly, other times it takes hours.

That makes sense, but strangely I've found my computer didn't quite behave like that when I was testing my overclock.

At 1.20V, I ran Prime95 and it failed after about 3 hours. I slowly increased the voltage by 0.02V at a time and re-ran the test each time, and there was a definite pattern. Increasing VCore increased the time between failures. At 1.204V, it would still fail after about 3 hours. At 1.208V it managed about 6-8 hours, at 1.214V I was up to around 12 hours. Statistically, shouldn't some runs at between 1.20V and 1.214V have resulted in a failure before ~3 hours since they were all relatively unstable?

At 1.218V I can start Prime95 at night before I go to bed and it will still be running when I come back from work the next evening.

SOFTengCOMPelec · Jul 18, 2013

VirtualLarry said:
Try two weeks worth of stability testing, before it errors.

See my old thread about it, "Stable, but not stable enough"

http://forums.anandtech.com/showthread.php?t=266521

Thanks, that is an interesting thread.

It kind of proves Idontcare's (and the OP's) theory. Even if you run it for many days, you still DON'T necessarily know how stable it is or is not, in the long term, as it may need the full 28/30 days before failing, or even longer.

I'm not sure if ECC memory would help that much, if the computer is being overclocked. This is because the ECC memory may allow the overclock (at least as regards ram issues), to increase a bit, until even the ECC is problematic, and the computer crashes again etc.
(Without trying it, which I can't as I have no ECC systems, I'm not sure if what I just said, would apply, or not).

Interestingly, the upcoming DDR4 actually has a bit of ECC like functionality as standard, built in. It is NOT ECC data checking as such, it is more for checking the validity of addresses, via CRC checks. But still sounds good, anyway.

Idontcare · Jul 18, 2013

JimmiG said:
That makes sense, but strangely I've found my computer didn't quite behave like that when I was testing my overclock.

At 1.20V, I ran Prime95 and it failed after about 3 hours. I slowly increased the voltage by 0.02V at a time and re-ran the test each time, and there was a definite pattern. Increasing VCore increased the time between failures. At 1.204V, it would still fail after about 3 hours. At 1.208V it managed about 6-8 hours, at 1.214V I was up to around 12 hours. Statistically, shouldn't some runs at between 1.20V and 1.214V have resulted in a failure before ~3 hours since they were all relatively unstable?

At 1.218V I can start Prime95 at night before I go to bed and it will still be running when I come back from work the next evening.

It is all a matter of probability and how many times you run the test in a series.

Sometimes people who buy a single lottery ticket win the lottery while people who buy 5 and 10 tickets for the same lottery win nothing.

The point with doing the tests is to realize that a machine that passes an arbitrarily long (be it 1hr, 6hrs, 24hrs, 2wks) test once is not proof that it will do so a second time.

To have any kind of confidence (the statistical kind, not the ego-driven kind) that your machine truly has a near-zero expectation (as in mean, average value) of failing in the previously defined arbitrary test time (be it 1hr, 6hrs, 24hrs, 2wks) you must generate lots and lots of test runs and then statistically analyze the data for a time-to-failure model.

From that you can then speak to the reliability (again a statistically defined characterization) of your OC'ed machine.

What you are observing is that there is a great deal of overlap in the parent distributions of the failure rates for the given points on the shmoo plot that you were sampling. That is expected when you are dealing with a pseudo-controlled experimental setting (not in a lab with super-accurate power supplies, voltmeters, temperature controls, etc).

Search

Do CPUs fail stress tests at stock?

JimmiG

Platinum Member

SOFTengCOMPelec

Platinum Member

Idontcare

Elite Member

SOFTengCOMPelec

Platinum Member

Ferzerp

Diamond Member

Idontcare

Elite Member

VirtualLarry

No Lifer

JimmiG

Platinum Member

SOFTengCOMPelec

Platinum Member

Idontcare

Elite Member

TRENDING THREADS