I've been a distributed computing volunteer for a few years now. After SETI@Home new DC projects were developed. Thankfully there are more practical projects than the "pie-in-the-sky" motive behind SETI. Don't get me wrong, I've put many hours into SETI, I like the project, but I think something like ClimatePrediction is much more practical. I recently started ClimatePrediction and was quite surprised to find out one work unit would take me 2500hours to complete vs. 2hours for a SETI work unit.
SETI has a redundant result setup: they send out 2 work units to different computers, if there is a discrepancy they send out a third. This is a smart implementation because there is no guarantee as to the end users' system stability or even if they are faking results to get a higher rating (a problem in the early days). But CP doesn't do a double run, and not only that, they are depending on long runs on consumer level systems that have no ECC RAM. As such the data integrity of CP is rather dubious. I have brought this up on their own message board and dutifully got a response from the site administrator but no real answer. You can read my thread here: http://www.climateprediction.net/board/viewtopic.php?t=4742
Most people are unaware of the SER (soft-bit error rate) problem.
For your own edification on SER and its impact on distributed computing:
http://perc.nersc.gov/papers/assessfault-sc04.pdf
If you will note, it's a link to a .pdf. (I, myself, hate opening a link only to have Adobe bloatware? laboriously load-up.)
It is a must-read! Here are some highlights:
Abstract,
"Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications.
Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected."
Other highlights of interest:
"...even using a conservative soft error rate, a system with 1 GB of RAM can expect a soft error every 10 days."
"...ECC does not eliminate all soft errors. Compaq reported that roughly 10 percent of errors are not caught by the on-chip ECC."
"Intel reported that the soft error rate for SRAMs increased thirty fold when the process technology shifted from 0.25 to 0.18 micron features and the supply voltage dropped from 2V to 1.6V."
"IBM showed that the soft error rate in Denver was ten times higher than that at sea level." [Denver is at 5,280 feet or 1,609 meters]
"Without hardware checksums, ECC memory and application specific error checking, soft errors, particularly on large systems, will trigger application crashes, hangs or incorrect results."
"Use of internal checks [software] is an important aspect of robust application implementation, but must be used wisely because excessive checks can still harm performance."
"...ABFT [Algorithm-Based Fault Tolerance] can detect almost all injected faults with only a ten percent performance penalty."
SETI has a redundant result setup: they send out 2 work units to different computers, if there is a discrepancy they send out a third. This is a smart implementation because there is no guarantee as to the end users' system stability or even if they are faking results to get a higher rating (a problem in the early days). But CP doesn't do a double run, and not only that, they are depending on long runs on consumer level systems that have no ECC RAM. As such the data integrity of CP is rather dubious. I have brought this up on their own message board and dutifully got a response from the site administrator but no real answer. You can read my thread here: http://www.climateprediction.net/board/viewtopic.php?t=4742
Most people are unaware of the SER (soft-bit error rate) problem.
For your own edification on SER and its impact on distributed computing:
http://perc.nersc.gov/papers/assessfault-sc04.pdf
If you will note, it's a link to a .pdf. (I, myself, hate opening a link only to have Adobe bloatware? laboriously load-up.)
It is a must-read! Here are some highlights:
Abstract,
"Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications.
Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected."
Other highlights of interest:
"...even using a conservative soft error rate, a system with 1 GB of RAM can expect a soft error every 10 days."
"...ECC does not eliminate all soft errors. Compaq reported that roughly 10 percent of errors are not caught by the on-chip ECC."
"Intel reported that the soft error rate for SRAMs increased thirty fold when the process technology shifted from 0.25 to 0.18 micron features and the supply voltage dropped from 2V to 1.6V."
"IBM showed that the soft error rate in Denver was ten times higher than that at sea level." [Denver is at 5,280 feet or 1,609 meters]
"Without hardware checksums, ECC memory and application specific error checking, soft errors, particularly on large systems, will trigger application crashes, hangs or incorrect results."
"Use of internal checks [software] is an important aspect of robust application implementation, but must be used wisely because excessive checks can still harm performance."
"...ABFT [Algorithm-Based Fault Tolerance] can detect almost all injected faults with only a ten percent performance penalty."