Here is my standard blurb on why ECC RAM can be important and the
lack of it can shred the contents of your harddrive:
It is true that the average error rate of RAM has been lowered
dramatically, and it has been lowered faster than the number of RAM
cells in a system has increased. So even though you may have 2
Gigabytes of RAM now you have less errors per year than you had with
64 Megabytes ten years ago.
But, the rate is not zero. And the bigger problem is RAM physically
going bad. You can easily have one of chips malfunction, or you can
get some dirt in the sockets or cause other problems that make your
RAM return something you didn't write into it.
Now, why is that so bad? Obviously you have random values and strings
in programs turning around and you can get wrong numbers of
calculations or you can get segfaults as a result of "bent" pointers.
But that's not the real issue. The real issue is that it can shred
the contents of your harddrive.
Why can it shred the contents of your harddrive?
Because the filesystem buffer cache can, and will, be affected. The
operating system uses a certain amount of RAM as cache for the
filesystem. That can be quite a lot. A home PC with 1024 MB RAM and
running Windows and Office has plenty of RAM to spare and operating
systems (Windows and Linux alike) use all the memory not used for the
application for cache of disk contents. Besides "harmless" caches of
readonly pages like for the application's code itself there is also a
write-back cache of filesystem contents.
The cache works by having a copy of a disk block and associated with
it a value that indicates where that block belongs.
Now, if you have bad memory, you might have one of these blocks
modified and the wrong data is eventually written to the disk.
That is the harmless case.
The harmful case is that the index that says where that blocks belongs
is damaged. If that happens the right blocks gets written to the
wrong place on the disk.
So you can instantly ruin (corrupt with random data) any file on the
filesystem. Even files you haven't even looked at since the PC is up.
Any file.
And it gets worse. If the new faulty location overwrites a directory
entry, then you can lose any number of files in one snap. Or if it
its any allocation table, you end up with totally wrong blocks in some
files. Or you can overwrite an allocation table holding blocks for a
directory. Or you can kill the superblock or other critical
information that makes you unable to mount the filesystem at all.
Note that this is true even for filesystems that don't cache
directories and allocation tables, caching only file contents. It is
a file content cache block that is written over a directory or
allocation table. The victim of mishap, the new target block,
doesn't have to be in the cache to become a victim.
So this is why I have ECC RAM.
For me as a software developer ECC RAM is also a question of
necessary paranoia. If I get a segfault in my programs I need to be
absolutely sure that it is caused by my application or by my kernel
changes. I need to be absolutely sure it is not hardware. It would
truly suck to spend weeks to track down a memory corruption that turns
out to be a hardware problem.