Why isn't ECC memory used more?

Phynaz · Mar 29, 2012

tynopik said:
a weak cell that isn't always read reliably

Such a device would fail validation and would not be installed / offered for sale, as presented in this presentation from Samsung.
http://books.google.com/books?id=9G...ce=gbs_ge_summary_r&cad=0#v=onepage&q&f=false

Or in this paper on enterprise class memory testing:
http://www.smartm.com/files/salesLiterature/dram/smart_whitepaper_sbe.pdf

If one of these "weak cells" does get by testing, correcting it doesn't do you any good because the entire module is at high risk of latent failure and should be replaced.

I'd be interested in any references you can provide that states ECC memory is designed to correct hardware failure.

NP Complete · Mar 29, 2012

Cerb said:
I'm not sure about Ye Old Days, but with Chipkill and equivalents, that should not be the case.

Quite true - depending on the error correction (including Chipkill) errors in the ECC codes themselves aren't fatal.

However, not all ECC memories implement Chipkill, a metric I believe the google DRAM study addresses - http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

From memory (my own), the chipkill chips had a lower incidence of uncorrectable errors, while those with "other" ECC mechanisms had a higher rate of uncorrectable errors.

Basic point - ECC isn't always a guarentee of single bit error immunity (but it can be).

VirtualLarry · Mar 29, 2012

NP Complete said:
Just for fun - what happens if your single bit error is in your ECC code? ECC codes are stored in memory like the rest of the data and aren't immune to corruption.

If you have a single bit error in your ECC code, it'll likely show up as an uncorrectable error to the OS.

This is not true. There is a mathematical relationship between all of the bits in an ECC word.

tynopik · Mar 29, 2012

Phynaz said:
Such a device would fail validation and would not be installed / offered for sale, as presented in this presentation from Samsung.

memory degrades over time, so what passes validation initially might present problems later

Phynaz said:
I'd be interested in any references you can provide that states ECC memory is designed to correct hardware failure.

please be precise. what 'hardware failure' are you talking about exactly?

tynopik · Mar 29, 2012

NP Complete said:
How is ECC "not optional" like hyperthreading, or even more cores on a CUDA chip?

To use the car analogy, hyperthreading or more CUDA cores give you more power, so are like a bigger engine. There is no requirement that your car comes with a 300hp engine. Manufacturers can offer models with 300hp, 200hp, 150hp and even 80hp at different price points. Nothing wrong with that.

However even the 80hp model is required to pass safety tests.

NP Complete said:
Your statement hyperbolizes the issue - you make it seem like without ECC computers will be crashing every day, and documents will "rot" into an unusable state quickly.

I never said they would crash every day.

NP Complete said:
Companies actually do perform quality analysis on the products they ship - if fatal errors were as high as you seem to indicate

you mean as high as real world testing shows they are?

NP Complete said:
then most companies wouldn't be stupid enough to ship the product.

exactly!

NP Complete said:
So what if I get a single pixel corrupted in my picture album? So what if my OS becomes unstable every 2-3 years? I'm willing to re-install my OS, or re-touch up my pictures occasionally if it means I can upgrade my computer more often.

So what if your resume has an embarrassing typo, so what if your tax return has a wrong number?

NP Complete said:
Would I store nuclear launch codes on a consumer-grade desktop computer? Hell no,

People still store vital and critical information on their computers and they should be assured that it won't be corrupted.

NP Complete said:
If you want ECC, and want the extra assurance ECC provides, have at it!

It should be mandatory. Consumers should never have to know or care what ECC is, they should just know that their computers work and are reliable.

Just like consumers should never have to understand what makes for good gas tank design, they should just know that it won't explode in an accident.

Safety is not an option.

NP Complete said:
I also hope you're follow good backup procedures since even ECC has uncorrectable error rates of a few hundred per year, which can result in data loss.

In case of an uncorrectable error, ECC prevents bad data from being written to disk in the first place

tynopik · Mar 29, 2012

NP Complete said:
Just for fun - what happens if your single bit error is in your ECC code? ECC codes are stored in memory like the rest of the data and aren't immune to corruption.

http://en.wikipedia.org/wiki/SECDED

If only one parity bit indicates an error, the parity bit itself is in error.

Thus errors in the ECC code can be detected and discarded.

Phynaz · Mar 29, 2012

tynopik said:
memory degrades over time, so what passes validation initially might present problems later

please be precise. what 'hardware failure' are you talking about exactly?

You know exactly what I mean.

Memory hardware failure.

tynopik · Mar 29, 2012

Phynaz said:
You know exactly what I mean.

Memory hardware failure.

no i have no idea what you mean

that's like saying 'car failure', it could mean anything

cells can become weak, meaning that sometimes they're read incorrectly

bits can become stuck so they're always the same value

the soldering on a chip might go bad causing intermittent problems

you might take a chisel and knock a chip off a dimm

all those are covered under 'memory hardware failure', it's so generic as to be meaningless

Phynaz · Mar 29, 2012

Others here have shown you definitive evidence of what ECC is designed to do.

You refuse to post evidence of what you say. I wonder why?

tynopik · Mar 29, 2012

Phynaz said:
Others here have shown you definitive evidence of what ECC is designed to do.

You refuse to post evidence of what you say. I wonder why?

stop speaking in riddles. what exactly am i saying that you disagree with? that ECC will correct single-bit errors? that ECC will detect double-bit errors? That ECC memory doesn't silently fail like non-ECC memory?

tynopik · Mar 29, 2012

Phynaz said:
If one of these "weak cells" does get by testing, correcting it doesn't do you any good because the entire module is at high risk of latent failure and should be replaced.

ECC isn't JUST about correcting errors, it's also about DETECTING errors in the first place.

You're exactly right that a module with one failure is likely to have more. The Google paper agrees, errors are highly correlated.

ECC provides you with 'early warning' that these modules need to be replaced.

Without ECC the errors will get worse and worse and your system will get more and more unstable and more and more data will be corrupted and you'll have no clue as to why.

With ECC you'll know there's a problem with the memory and can get that module replaced before it gets too bad.

fuzzymath10 · Mar 29, 2012

Considering the number of memory modules out there, I think in aggregate the rate of occurrence of defective modules is low enough that nobody cares. ECC is not a magical solution; if too many errors occur, ECC might mistakenly think it's legitimate; this is extremely improbable. Hard drive specs usually state an error rate corresponding to the probability that enough bits are flipped that the ECC cannot even detect the error and it's something like 1 in 10^18 (can't remember for sure).

I studied basic coding theory as an undergrad and I'm pretty sure that error correcting codes all have a cost; you need extra bits to be able to detect and/or correct errors (and how they're utilized characterizes each different coding method; some codes are much more efficient than others) and this should increase the raw materials required to produce an ECC module, even though the market price could fluctuate.

The vast majority of errors are caused by crappy software, running hardware out of spec (e.g. messing with voltages, overclocking, etc.), or some random hardware incompatibility that ECC would not fix. The feature IS available if it is worth the money to you, and if not, there's always AMD.

tynopik · Mar 30, 2012

fuzzymath10 said:
The feature IS available if it is worth the money to you, and if not, there's always AMD.

Consumers shouldn't have to know anything about coding theory to be confident that their computer is not corrupting data.

It's the same reason there are government standards for food safety that restaurants are required to adhere to and that there are government standards for car safety that auto manufacturers have to adhere to.

We don't say, 'Well, if not getting salmonella is important to you, you can go to the restaurant with independent certification that it's clean.' No! All consumers at all restaurants deserve to be safe. There is no option.

Without oversight, companies have an economic incentive to cut corners and if a company can save 2 cents by allowing your tax return to get corrupted, they'll do it 10 times out of 10.

Here, Intel needs to take the position of 'the government' by REQUIRING ECC memory for ALL their chips. They shouldn't even allow companies the option to cut corners in this regard. This creates a level playing field for companies to compete on grounds other than safety.

paperwastage · Mar 30, 2012

^ nope...

its about tradeoffs

do you want to pay $100 for 99% uptime... or $100,000 for 99.999% uptime?

http://www.itsmsolutions.com/newsletters/DITYvol2iss47.htm

http://vijaygill.wordpress.com/2010/11/10/nines/

Ben90 · Mar 30, 2012

tynopik said:
huiopwetopndknjwerg;njwegl;jknwegmn3;oin3nmdt

WEhk
yk;lj4iop5jyi9ohj6h
w4hihjjjhj

Hmmm weird. I think your UDP packet got corrupted on the way to the server. We should use TCP links all the time because its more reliable.

tynopik · Mar 30, 2012

paperwastage said:
^ nope...

its about tradeoffs

do you want to pay $100 for 99% uptime... or $100,000 for 99.999% uptime?

1. ECC doesn't cost $100 extra
2. You're confusing uptime and correctness

Consumers don't need five 9 uptime so paying for dual powersupplies and similar redundancy doesn't make sense.

But they still need data integrity. If the computer is running, it should not be corrupting their data.

tynopik · Mar 30, 2012

Ben90 said:
Hmmm weird. I think your UDP packet got corrupted on the way to the server. We should use TCP links all the time because its more reliable.

UDP is used where data loss is acceptable

For many applications on a computer, data loss is not acceptable.

Saying that because there are SOME network applications where data loss is acceptable, therefore ALL computer applications are ok with data corruption is just nonsense.

Ben90 · Mar 30, 2012

Yes its cherry picked.

anikhtos · Mar 30, 2012

i agree with tynopic on these
okey ecc is not magic
ecc can not correct everything
but frankly if ecc was that crap as you apear to be why servers bother to use it??
so ecc is good for something
and for crying out loud noone said that a consumer pc must be up 100% as a server.
but if a small cost of 20-30 euro or ven less in cost of memory
can ensure a more stable pc
and prevent data corruption then yes it must be enforced as minimal.
after all if consumer really know that there is a chance of having their data corrupted by ram i think they would be more sceptical about the computers and storing valuable data.
well even a single bit eror can make a doc not to open.

we always talk about cost and cost and cost
it reminds me when new mobo get off the floppy
and before you start if you want to build a system with winXP
you need to feed with drivers so you need a floppy to go along
so i buy a new 200 euro mobo with no floppy they saved what 1 euro???
and i have to pay 40 euro for a usb floppy
so the company made one more euro and i lost 40
wow
and if ecc ram will be the norm then there will not be a non ecc ram to compare and say ohhhhhh that is 10 euro cheaper
so there will not be a price penalty to consider
and such a small price to change attitude can go unnoticed (well in these forum everyone is having so much expensive hardaware and we are debating for 10 euro?)
my thoughts

CLite · Mar 30, 2012

Ben90 said:
Yes its cherry picked.

http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf

No shit it's cherry picked and the ECC-on/ECC-off are not the same cards, the ECC-on card has less memory (about 12% less). The 4th slide shows cuFFT 3.1 which has ECC-on/ECC-off almost the same, you linked an outdated driver (cuFFT 3.0) with overall worse performance. So either you are oblivious or you are disingenuous.

tynopik · Mar 30, 2012

CLite said:
No shit it's cherry picked and the ECC-on/ECC-off are not the same cards, the ECC-on card has less memory (about 12% less).

Agree with the point that with new 3.1 library, there's only marginal difference, but they are actually the same card

It's just that with ECC-on it devotes a certain percentage of memory to storing checksumming data. This is different than the computer arrangement that uses an extra chip to hold the checksumming data so you don't lose capacity (although you do pay slightly more)

NP Complete · Mar 30, 2012

I think I'll have to agree to disagree with you, tynopik. I agree with most of your points (memory errors are bad, ECC either detects for fixes most of these errors, etc), however I still believe you're over emphasize both the chance of corruption (you've provided no hard numbers) and that the chance of corruption vs extra cost is worth it.

There are many devices in the a computer that can cause corruption: the SRAM and registers in the CPU are by no means immune to bit flips. Hard drives are not immune to bit rot. Network cards, cables and routers can corrupt information. In all these cases steps are taken to minimize the chance of corruption, but there is no 100% error free part. I'd assert that non-ECC memory in current usage models has an acceptable error rate when compared to the error rates of other consumer grade parts. A transient error in the CPU, network card, or hard drive can be just as devestating for data integrity as problems with memory.

Also, if you read the google paper, you'll see that not all ECC dram chips use have the same rate for uncorrectable errors. The paper specifically mentions chipkill vs non-chipkill memory. Once again, ECC implementations are all about cost vs benefit. Being able to correct more errors requires more bits - and more cost.

If ECC were "free", then by all means I'd advocate putting it in all devices. But it's not - there's cost both with manufacturing and with development. You make some good points, but your position of "ECC for everyone" without providing a cost/benefit analysis at even a basic level (e.g. "I'd pay $100 to guarentee no corruption") makes your points come across as zealotry and make it difficult to have a rational discussion.

tynopik · Mar 30, 2012

NP Complete said:
however I still believe you're over emphasize both the chance of corruption (you've provided no hard numbers)

? the google study is full of hard numbers

over 8% of all DIMMs suffer at least one error in a year

and those that have at least one error usually have more than one

and that's server-grade memory. I dare say consumer memory is even worse.

NP Complete said:
and that the chance of corruption vs extra cost is worth it.

you sound like one of those morbid risk assessment people for car manufacturers who compare the cost X of fixing a defect vs cost Y of settling the lawsuits from people they kill and maim.

NP Complete said:
There are many devices in the a computer that can cause corruption: the SRAM and registers in the CPU are by no means immune to bit flips. Hard drives are not immune to bit rot. Network cards, cables and routers can corrupt information. In all these cases steps are taken to minimize the chance of corruption, but there is no 100% error free part.

Obviously it can never be perfect, but that doesn't mean it can't be SUBSTANTIALLY better.

Saying "well we can't make it 100% better, so we shouldn't try to make it 50% better" is a copout

ECC memory is the most cost-effective way to further increase the data integrity of modern systems.

NP Complete said:
Also, if you read the google paper, you'll see that not all ECC dram chips use have the same rate for uncorrectable errors. The paper specifically mentions chipkill vs non-chipkill memory.

Again, it's not just about correcting errors, it's also about detecting errors. Even if ECC can't correct a given error, it can still detect it and prevent it from being written to disk. Silent data corruption is a huge issue and ECC resolves it.

NP Complete said:
Once again, ECC implementations are all about cost vs benefit. Being able to correct more errors requires more bits - and more cost.

The cost of ECC is so minimal that if it was made mandatory, most consumers would never even be able to tell a difference.

NP Complete said:
If ECC were "free", then by all means I'd advocate putting it in all devices.

it's so cheap, it essentially is

NP Complete said:
You make some good points, but your position of "ECC for everyone" without providing a cost/benefit analysis at even a basic level (e.g. "I'd pay $100 to guarentee no corruption") makes your points come across as zealotry and make it difficult to have a rational discussion.

Cost/benefit for who?

For the manufacturers? Of course it's not worth it for the manufacturers (at least in the short term). If you sent in the wrong taxes and now have to pay a huge penalty, what do they care? They saved $5 and that's all that matters.

On the other hand if you don't get your dream job because your resume suffered an embarrassing glitch, what is the cost to you?

NP Complete · Mar 30, 2012

Tynopik, your passion has inspired me to settle this issue with science! I'm going to write up a program and run it over the course of a couple days on whatever system I can access (if I'm lucky, I may be able to sneak it into my works test lab and snag a few hundred machines for a few hours - I don't think they have ECC on all of our test machines

Wikipedia has some breakouts of DRAM error rates that seem to be extrapolated from the google paper - the number I've seen is 3-10x10^-9 bit errors/hour: http://en.wikipedia.org/wiki/ECC_memory - this roughly translates into 1 error per GB per hour at the exterme end! If there were a 1 to 1 correlation between memory errors and corrupted documents, this would be horrible indeed (something I would gladly pay to avoid). However, I bet that the observed error rate of propogated errors is actually far lower.

Not all data corruption is equal: if my screen buffer has a bad pixel, the effect is transient and I don't care (especially because the "problem" will likely be fixed in the next screen refresh). If, say, my resume gets corrupted after I edit it, that's much worse. Operating systems, programs, codecs, often employ checksums and the like to determine data integrity as well, which would reduce the chance for catastrophic errors.

I'd purpose the following program:
1) Read data from specified source (disk or network)
2) Check the data is correct
3) Compute next value (e.g. increment value)
3) Write data to source (disk or network)
Essentially a variation on memcheck, but adding all the other subsystems in as well. I'm interested in the question: how often does data get corrupted on a consumer's system in an actual real life scenario? Do you have any suggestions to improve the experimental method?

I'll do a little research first, and see if there is anything published about overall corruption rates.

tynopik · Mar 30, 2012

NP Complete said:
Not all data corruption is equal: if my screen buffer has a bad pixel, the effect is transient and I don't care (especially because the "problem" will likely be fixed in the next screen refresh). If, say, my resume gets corrupted after I edit it, that's much worse.

The likelihood of an error in memory resulting in data corruption is dependent on several factors.

Obviously the size of the document is the most important. The more data being written to disk, the more likely it is to get hit by an error.

But errors can also hit the program and corrupt the saving routine or they can impact the OS itself and the various IO routines there.

What would be a representative test? I have no idea.

Something that might stress the systems would be to compress a large amount of files to an archive then extract the files to a different directory and run a file compare. Delete the archive and the extracted folders and repeat.

Of course this doesn't distinguish between memory, cpu or disk errors, but it gives some idea of the total error rates systems might experience.

Why isn't ECC memory used more?

Lifer

Member

No Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Golden Member

Diamond Member

Member

Diamond Member

Member

Diamond Member