"Nine nines" stability and overclocking

Syzygies

Senior member
Mar 7, 2008
229
0
0
I'm overclocking a Q6600 to 3.2 Ghz on air, for scientific computation. This is an arbitrary stopping point, with max core temps 60 to 62 C depending on ambient.

The title of this thread is taken from the "Erlang" parallel programming language, developed by Ericsson, a phone company. Like a pair of LL Bean boots where one replaces the uppers m times and the lowers n times, still calling them the same boots, one can continually hot-swap code and hardware into an Erlang program, and call it the same program. Their primary concurrency goal is not to harness extra speed from multiple processors, but to have a self-healing system e.g. if an avalanche takes out a village where half their machines are running. They tout a 99.9999999% reliability rate ("nine nines") which is unheard of in telecom. None of our overclocked boxes come anywhere close to this reliability, not that we keep any of them long enough to find out!

My hardest stress test is not Prime95 (mprime on Linux) but rather daisy-chaining builds of the GHC Haskell compiler from source, several on each core. A friend has a $10K, 8 core, 64 GB server that has been living half the time in the shop because of this stress test, which once caused his power supply to smoke and threaten fire, in front of various amused witnesses. The shop has now taken this stress test in-house, to save delivery cycles. It appears, with a new Tyan motherboard, that all is nearly ok: 99% of the builds ("two nines") succeed, and no other test is capable of revealing any hardware issues.

In contrast, we're amused to read on overclocking forums when it is someone's "policy" to accept less than 24 hours of Prime95 as stable. We'd actually like as many "nines" as we can get for stability. This is not an argument against overclocking; one has to instead understand overclocking differently. As I'm new to overclocking, others may have much better insights into how to balance these goals. I'm contributing what I know.

I dislike "panel discussions" where the moderator talks half the time. I'm paying rapt attention to any responses here, I'm just not planning to goal-tend unless I'm asked direct questions.

I'm not being judgemental here; were I a gamer, I would certainly have moved to phase change cooling by now, and I'd accept a system freeze every day or two if it bought me more thrills in between. Without gamers there simply wouldn't be a market for the motherboards I'm buying, or the overclocking expertise I'm relying on. For this I'm grateful.

What got me over the hump in learning to overclock were the articles here by Kris Boughton, which I thought were brilliant, hard to read, but ultimately a complete education in overclocking issues. Alas, I've come full circle, and for my purposes I'm in disagreement with some conclusions.

Load Line Calibration: I'm in complete agreement, monkeying with this is a completely unwarranted risk, costing various "nines" in stability.

tRD: I see at most a 0.5% effect on execution times for practical scientific computations, adjusting tRD say between 6 and 7. At 7 I get more "nines" of reliability. This choice seems a no-brainer.

CPU voltage: A few weeks ago, setting 1.28125 V in my BIOS was enough to keep my Q6600 stable at 3.2 Ghz. This was however the minimum value I could use. After a few weeks away, with the box powered off, I could not boot into Linux at this voltage. The minimum voltage crept up over the next few days, now settling at 1.30625. I've heard of aging but didn't expect to see it so soon, as part of "breaking in". In any case, more "nines" of stability requires a margin of error here. Unless there's a "glasses will only make your eyes worse" argument here, that all voltage increases feed into an unstable equilibrium? I don't know such an argument.

In short, I'm arbitrarily settling on 3.20 Ghz for 24/7 use often at full load, but many reasonable people would accept the temps and voltages required for 3.30 Ghz. In realistic benchmarks, relaxing my memory timings can be covered by a mere increase to 3.24 Ghz, an arbitrary compensation I don't need to actually make.

The ideal balance for overclocking and stability would appear to be to relax memory and be a bit generous on voltages, picking an arbitrary overclocking target that itself offers a decent margin of error.

I can see the intoxicating appeal of playing with memory timings for speed: Basic overclocking goes from hard to utterly trivial, once one learns what to do. Like windsurfers picking a trickier board, or base jumpers wearing flying suits to soar parallel to the cliff, there has to be something harder one learns next, right? I'd say, use this knowledge in reverse to gain greater stability, and move on. For me, that's parallel software that can actually use my cores.
 

trexpesto

Golden Member
Jun 3, 2004
1,237
0
0
Burn-in, burn on, burn out! :evil:

Original OC on this Barton at 1.65 Volts was good for about a year, later had to increase to 1.7 Volts.

My meaningful statistic: It lasted to obsolescence, running peacefully at over 90% of maximum overclock.

You could say that the extra cash for a 3X00 would have been worth buying at a premium, amortized over all these years, but that cash DID NOT EXIST in 1923! I looked!





 

Drsignguy

Platinum Member
Mar 24, 2002
2,264
0
76
Originally posted by: trexpesto
Burn-in, burn on, burn out! :evil:

Original OC on this Barton at 1.65 Volts was good for about a year, later had to increase to 1.7 Volts.

My meaningful statistic: It lasted to obsolescence, running peacefully at over 90% of maximum overclock.

You could say that the extra cash for a 3X00 would have been worth buying at a premium, amortized over all these years, but that cash DID NOT EXIST in 1923! I looked!




ironically, that was your 999th post!

 

Tweakin

Platinum Member
Feb 7, 2000
2,532
0
71
I like to take my chip up as far as it will go on stock volts, run at this value to establish Prime/Orthos 24hr stable on small fft's, then do another 24 hr run at blend. Then I drop the fsb down 5% for voltage creep, temp changes etc..... just my method.
 

lopri

Elite Member
Jul 27, 2002
13,329
709
126
Why stop at tRD=6 or 7? You can tighten it down to 5 or even 4! :D
 

Rubycon

Madame President
Aug 10, 2005
17,768
485
126
Bottom line is if you want maximum stability AND computational accuracy 'round the clock leave everything at default settings. Leave the margins up to the CPU manufacturer, not the end user as the former has the final word on the reliability of the silicon at promised speed. :)
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: Rubycon
Bottom line is if you want maximum stability AND computational accuracy 'round the clock leave everything at default settings.

Yikes, can one say that in a overclocking forum?

If I don't actually trust others to make these decisions for me, how do I decide that the defaults are optimum, rather than some underclocked point?

Chip manufacture is a perfect example where one can detect errors, and it is optimal to accept an error rate. If each of my computations takes four weeks at stock speeds, or three weeks overclocked, and I can detect wrong answers, then I'm better off overclocking if "most" of my jobs will finish. However, if I can almost always get through a day but rarely get through three weeks, then I should overclock less.

The whole point of Erlang's "nine nines", and of computing in general, is to build reliable systems from unreliable parts. It can be optimal to overclock, if the system as a whole can detect and handle errors gracefully. I wouldn't say I should run at stock, I'd say I'm being way too conservative running at 3.2 Ghz.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Syzygies
Originally posted by: Rubycon
Bottom line is if you want maximum stability AND computational accuracy 'round the clock leave everything at default settings.

Yikes, can one say that in a overclocking forum?

If I don't actually trust others to make these decisions for me, how do I decide that the defaults are optimum, rather than some underclocked point?

Chip manufacture is a perfect example where one can detect errors, and it is optimal to accept an error rate. If each of my computations takes four weeks at stock speeds, or three weeks overclocked, and I can detect wrong answers, then I'm better off overclocking if "most" of my jobs will finish. However, if I can almost always get through a day but rarely get through three weeks, then I should overclock less.

The whole point of Erlang's "nine nines", and of computing in general, is to build reliable systems from unreliable parts. It can be optimal to overclock, if the system as a whole can detect and handle errors gracefully. I wouldn't say I should run at stock, I'd say I'm being way too conservative running at 3.2 Ghz.

The catch, of course, is detecting errors. On a single machine, you can't really do that since the processor could lock up completely. It sounds like you're saying, "do every task as fast as it can be done reliably"... if that's the case, I think you'd find the Razor architecture interesting. It overclocks in a way that allows errors to be detected, and adjusts the frequency and voltage so that a low but non-zero error rate is maintained (if the error rate was 0, you'd be leaving potential performance on the table; if it's too high, recovery becomes too expensive).

edit: By the way, CPUs slow down as they age. Just because a CPU works at some frequency today doesn't mean it'll work at the same frequency in a few months or a couple years. This throws a pretty big wrench into any plan that assumes that a CPU that passes tests at 3.2GHz today will continue to pass them in the future no matter how thorough your testing today is.
 

nyker96

Diamond Member
Apr 19, 2005
5,630
2
81
I'm getting too old, my attention span getting low, read the first para. I think if you go server stable you have to test it continuously for like weeks before putting it down to production work.