SER impact on distributed computing

Kochab · Apr 28, 2006

I've been a distributed computing volunteer for a few years now. After SETI@Home new DC projects were developed. Thankfully there are more practical projects than the "pie-in-the-sky" motive behind SETI. Don't get me wrong, I've put many hours into SETI, I like the project, but I think something like ClimatePrediction is much more practical. I recently started ClimatePrediction and was quite surprised to find out one work unit would take me 2500hours to complete vs. 2hours for a SETI work unit.

SETI has a redundant result setup: they send out 2 work units to different computers, if there is a discrepancy they send out a third. This is a smart implementation because there is no guarantee as to the end users' system stability or even if they are faking results to get a higher rating (a problem in the early days). But CP doesn't do a double run, and not only that, they are depending on long runs on consumer level systems that have no ECC RAM. As such the data integrity of CP is rather dubious. I have brought this up on their own message board and dutifully got a response from the site administrator but no real answer. You can read my thread here: http://www.climateprediction.net/board/viewtopic.php?t=4742

Most people are unaware of the SER (soft-bit error rate) problem.

For your own edification on SER and its impact on distributed computing:
http://perc.nersc.gov/papers/assessfault-sc04.pdf

If you will note, it's a link to a .pdf. (I, myself, hate opening a link only to have Adobe bloatware? laboriously load-up.)

It is a must-read! Here are some highlights:

Abstract,
"Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications.

Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected."

Other highlights of interest:
"...even using a conservative soft error rate, a system with 1 GB of RAM can expect a soft error every 10 days."

"...ECC does not eliminate all soft errors. Compaq reported that roughly 10 percent of errors are not caught by the on-chip ECC."

"Intel reported that the soft error rate for SRAMs increased thirty fold when the process technology shifted from 0.25 to 0.18 micron features and the supply voltage dropped from 2V to 1.6V."

"IBM showed that the soft error rate in Denver was ten times higher than that at sea level." [Denver is at 5,280 feet or 1,609 meters]

"Without hardware checksums, ECC memory and application specific error checking, soft errors, particularly on large systems, will trigger application crashes, hangs or incorrect results."

"Use of internal checks [software] is an important aspect of robust application implementation, but must be used wisely because excessive checks can still harm performance."

"...ABFT [Algorithm-Based Fault Tolerance] can detect almost all injected faults with only a ten percent performance penalty."

petrusbroder · Apr 28, 2006

Very interesting readinhg. Thanks Kochab
That why redundence is so important in DC: three or more comps checking the same WU ...
OTOH: considering that there are other problems (such as using different compilers for different applications [one for the Mac-application, an other for Linux, a third for windows]) introduces errors too. Problems with chips (they seldom get produced without small errors) ad to it. So what do we have? A need to run the same WU on enough computers so the discrepancies can be seen and compensated for - either statistically (i.e calculate the error rate and compensate for it) or by "majority decision". Projects which do not use redundence are less reliable in their scientific results and thus need to crunch many more WUs to obtain the same validiy as those which use redundency ... lets crunch on - just as we know, that the same results need tobe crunched several times ...

I very much hope that the devs are thinking of this and develop appropriate and error detecting - error correcting software ...

Silverthorne · Apr 28, 2006

Originally posted by: Kochab

If you will note, it's a link to a .pdf. (I, myself, hate opening a link only to have Adobe bloatware? laboriously load-up.)

Check out the adobe reader speedup on this page, it works great.

petrusbroder · Apr 29, 2006

Thanks Silverthorne, there are some great downloads there!

Insidious · Apr 29, 2006

Great insight in that article.

It is a subject I have been griping about a little lately. Ironically, it is the BBC-CCE project that led to those gripes.

I don't believe the aspect of 'non perfect computing' (if you will) is addressed well at all in some (if not all) of these DC projects.

I see applications with no ability to recover when an error occurs (just wait until you have 2 or 3 months under your belt with a BBC WU and it decides to go stupid.... frustrating!)

I see evidence all over the place that these applications do not 'police themselves' to a degree that the application even knows it made a mistake (instead, today's strategy seems to be to simply send out 1000's of the same work unit to compensate which is extremely wasteful.)

IMO, at least some of the problems is scientists who may be brilliant in their respective fields, assuming that brilliance extends to their programming skills as well.......... (sorry Doc... doesn't look like it does)

Distributed Computing is a bit lacking in the computer end of the science, iin my opinion.

-Sid

Kochab · Apr 30, 2006

IMO, at least some of the problems is scientists who may be brilliant in their respective fields, assuming that brilliance extends to their programming skills as well.......... (sorry Doc... doesn't look like it does)

I, too, have gotten the distinct odor of incompetence from CPDN. A noble project but poorly implemented. I've put 120 hours into it and will stop until I build my next system with ECC (I'm sure I can just copy/paste the folder, I'll double-check though).

That doesn't change the situation for the majority of volunteers. CPDN ported this over from a mainframe program that's ~20 years old. Back then memory was more robust and less susceptible to soft-errors, and mainframes use ECC anyway. This is a glaring weakness and they've got their heads in the sand. "It's just way too difficult to change the million+ lines of code" is their thinking, meanwhile people are abandoning the project.

On a side note, two interesting memory technologies exist that are inherently rad-hard (that would make a great name for band

): MRAM and regular DRAM fabricated with SOI. MRAM is great because it's faster than current DRAM and it's non-volatile, so if you power off your comp you can turn it back on without the boot-up process! And SOI is good because you don't have to change the technology, just the fabrication process; it's also energy efficient (but not as much as MRAM). Honeywell is currently producing both for the aerospace industry and I want some!

Assimilator1 · May 3, 2006

MRAM sounds good!

,but how much is it?

Btw interesting point you bring up.

Kochab · May 5, 2006

MRAM sounds good!

Yeah! Once you know about MRAM DRAM looks like nothing. According to proponents it's nearly as fast as SRAM, is low-power, non-volatile, rad-hard and cheaper to manufacture. If you ask me "M" stands for miracle.

Reality check: no one can produce it in any functional sizes yet. I think the biggest commercial product now is 256KB. Also, I've read of slower speeds of ~200MB/s, but that may be a different flavor of MRAM.

Here is an April 2000 article on a researcher of MRAM (it's the same guy at IBM that brought us GMR read/write heads for harddrives): http://www.wired.com/wired/archive/8.04/mram_pr.html

and some general info: http://www.mram-info.com/

Assimilator1 · May 6, 2006

Thanks

That's rather long so I'll read it another time

Search

SER impact on distributed computing

Kochab

Member

petrusbroder

Elite Member

Silverthorne

Golden Member

petrusbroder

Elite Member

Insidious

Diamond Member

Kochab

Member

Assimilator1

Elite Member

Kochab

Member

Assimilator1

Elite Member

TRENDING THREADS