Why isn't ECC memory used more?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
a weak cell that isn't always read reliably

Such a device would fail validation and would not be installed / offered for sale, as presented in this presentation from Samsung.
http://books.google.com/books?id=9G...ce=gbs_ge_summary_r&cad=0#v=onepage&q&f=false

Or in this paper on enterprise class memory testing:
http://www.smartm.com/files/salesLiterature/dram/smart_whitepaper_sbe.pdf

If one of these "weak cells" does get by testing, correcting it doesn't do you any good because the entire module is at high risk of latent failure and should be replaced.

I'd be interested in any references you can provide that states ECC memory is designed to correct hardware failure.
 
Last edited:

NP Complete

Member
Jul 16, 2010
57
0
0
I'm not sure about Ye Old Days, but with Chipkill and equivalents, that should not be the case.

Quite true - depending on the error correction (including Chipkill) errors in the ECC codes themselves aren't fatal.

However, not all ECC memories implement Chipkill, a metric I believe the google DRAM study addresses - http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

From memory (my own), the chipkill chips had a lower incidence of uncorrectable errors, while those with "other" ECC mechanisms had a higher rate of uncorrectable errors.

Basic point - ECC isn't always a guarentee of single bit error immunity (but it can be).
 

VirtualLarry

No Lifer
Aug 25, 2001
56,570
10,204
126
Just for fun - what happens if your single bit error is in your ECC code? ECC codes are stored in memory like the rest of the data and aren't immune to corruption.

If you have a single bit error in your ECC code, it'll likely show up as an uncorrectable error to the OS.

This is not true. There is a mathematical relationship between all of the bits in an ECC word.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
Such a device would fail validation and would not be installed / offered for sale, as presented in this presentation from Samsung.

memory degrades over time, so what passes validation initially might present problems later

I'd be interested in any references you can provide that states ECC memory is designed to correct hardware failure.

please be precise. what 'hardware failure' are you talking about exactly?
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
How is ECC "not optional" like hyperthreading, or even more cores on a CUDA chip?

To use the car analogy, hyperthreading or more CUDA cores give you more power, so are like a bigger engine. There is no requirement that your car comes with a 300hp engine. Manufacturers can offer models with 300hp, 200hp, 150hp and even 80hp at different price points. Nothing wrong with that.

However even the 80hp model is required to pass safety tests.

Your statement hyperbolizes the issue - you make it seem like without ECC computers will be crashing every day, and documents will "rot" into an unusable state quickly.

I never said they would crash every day.

Companies actually do perform quality analysis on the products they ship - if fatal errors were as high as you seem to indicate

you mean as high as real world testing shows they are?

then most companies wouldn't be stupid enough to ship the product.

exactly!

So what if I get a single pixel corrupted in my picture album? So what if my OS becomes unstable every 2-3 years? I'm willing to re-install my OS, or re-touch up my pictures occasionally if it means I can upgrade my computer more often.

So what if your resume has an embarrassing typo, so what if your tax return has a wrong number?

Would I store nuclear launch codes on a consumer-grade desktop computer? Hell no,

People still store vital and critical information on their computers and they should be assured that it won't be corrupted.

If you want ECC, and want the extra assurance ECC provides, have at it!

It should be mandatory. Consumers should never have to know or care what ECC is, they should just know that their computers work and are reliable.

Just like consumers should never have to understand what makes for good gas tank design, they should just know that it won't explode in an accident.

Safety is not an option.

I also hope you're follow good backup procedures since even ECC has uncorrectable error rates of a few hundred per year, which can result in data loss.

In case of an uncorrectable error, ECC prevents bad data from being written to disk in the first place
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
memory degrades over time, so what passes validation initially might present problems later



please be precise. what 'hardware failure' are you talking about exactly?

You know exactly what I mean.

Memory hardware failure.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
You know exactly what I mean.

Memory hardware failure.

no i have no idea what you mean

that's like saying 'car failure', it could mean anything

cells can become weak, meaning that sometimes they're read incorrectly

bits can become stuck so they're always the same value

the soldering on a chip might go bad causing intermittent problems

you might take a chisel and knock a chip off a dimm

all those are covered under 'memory hardware failure', it's so generic as to be meaningless
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Others here have shown you definitive evidence of what ECC is designed to do.

You refuse to post evidence of what you say. I wonder why?
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
Others here have shown you definitive evidence of what ECC is designed to do.

You refuse to post evidence of what you say. I wonder why?

stop speaking in riddles. what exactly am i saying that you disagree with? that ECC will correct single-bit errors? that ECC will detect double-bit errors? That ECC memory doesn't silently fail like non-ECC memory?
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
If one of these "weak cells" does get by testing, correcting it doesn't do you any good because the entire module is at high risk of latent failure and should be replaced.

ECC isn't JUST about correcting errors, it's also about DETECTING errors in the first place.

You're exactly right that a module with one failure is likely to have more. The Google paper agrees, errors are highly correlated.

ECC provides you with 'early warning' that these modules need to be replaced.

Without ECC the errors will get worse and worse and your system will get more and more unstable and more and more data will be corrupted and you'll have no clue as to why.

With ECC you'll know there's a problem with the memory and can get that module replaced before it gets too bad.
 

fuzzymath10

Senior member
Feb 17, 2010
520
2
81
Considering the number of memory modules out there, I think in aggregate the rate of occurrence of defective modules is low enough that nobody cares. ECC is not a magical solution; if too many errors occur, ECC might mistakenly think it's legitimate; this is extremely improbable. Hard drive specs usually state an error rate corresponding to the probability that enough bits are flipped that the ECC cannot even detect the error and it's something like 1 in 10^18 (can't remember for sure).

I studied basic coding theory as an undergrad and I'm pretty sure that error correcting codes all have a cost; you need extra bits to be able to detect and/or correct errors (and how they're utilized characterizes each different coding method; some codes are much more efficient than others) and this should increase the raw materials required to produce an ECC module, even though the market price could fluctuate.

The vast majority of errors are caused by crappy software, running hardware out of spec (e.g. messing with voltages, overclocking, etc.), or some random hardware incompatibility that ECC would not fix. The feature IS available if it is worth the money to you, and if not, there's always AMD.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
The feature IS available if it is worth the money to you, and if not, there's always AMD.

Consumers shouldn't have to know anything about coding theory to be confident that their computer is not corrupting data.

It's the same reason there are government standards for food safety that restaurants are required to adhere to and that there are government standards for car safety that auto manufacturers have to adhere to.

We don't say, 'Well, if not getting salmonella is important to you, you can go to the restaurant with independent certification that it's clean.' No! All consumers at all restaurants deserve to be safe. There is no option.

Without oversight, companies have an economic incentive to cut corners and if a company can save 2 cents by allowing your tax return to get corrupted, they'll do it 10 times out of 10.

Here, Intel needs to take the position of 'the government' by REQUIRING ECC memory for ALL their chips. They shouldn't even allow companies the option to cut corners in this regard. This creates a level playing field for companies to compete on grounds other than safety.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
huiopwetopndknjwerg;njwegl;jknwegmn3;oin3nmdt

WEhk
yk;lj4iop5jyi9ohj6h
w4hihjjjhj
Hmmm weird. I think your UDP packet got corrupted on the way to the server. We should use TCP links all the time because its more reliable.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
^ nope...

its about tradeoffs

do you want to pay $100 for 99% uptime... or $100,000 for 99.999% uptime?

1. ECC doesn't cost $100 extra
2. You're confusing uptime and correctness

Consumers don't need five 9 uptime so paying for dual powersupplies and similar redundancy doesn't make sense.

But they still need data integrity. If the computer is running, it should not be corrupting their data.
 
Last edited:

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
Hmmm weird. I think your UDP packet got corrupted on the way to the server. We should use TCP links all the time because its more reliable.

UDP is used where data loss is acceptable

For many applications on a computer, data loss is not acceptable.

Saying that because there are SOME network applications where data loss is acceptable, therefore ALL computer applications are ok with data corruption is just nonsense.
 
Last edited:

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
Yes its cherry picked.
Untitledgfegw.png
 

anikhtos

Senior member
May 1, 2011
289
1
0
i agree with tynopic on these
okey ecc is not magic
ecc can not correct everything
but frankly if ecc was that crap as you apear to be why servers bother to use it??
so ecc is good for something
and for crying out loud noone said that a consumer pc must be up 100% as a server.
but if a small cost of 20-30 euro or ven less in cost of memory
can ensure a more stable pc
and prevent data corruption then yes it must be enforced as minimal.
after all if consumer really know that there is a chance of having their data corrupted by ram i think they would be more sceptical about the computers and storing valuable data.
well even a single bit eror can make a doc not to open.

we always talk about cost and cost and cost
it reminds me when new mobo get off the floppy
and before you start if you want to build a system with winXP
you need to feed with drivers so you need a floppy to go along
so i buy a new 200 euro mobo with no floppy they saved what 1 euro???
and i have to pay 40 euro for a usb floppy
so the company made one more euro and i lost 40
wow
and if ecc ram will be the norm then there will not be a non ecc ram to compare and say ohhhhhh that is 10 euro cheaper
so there will not be a price penalty to consider
and such a small price to change attitude can go unnoticed (well in these forum everyone is having so much expensive hardaware and we are debating for 10 euro?)
my thoughts
 

CLite

Golden Member
Dec 6, 2005
1,726
7
76

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
No shit it's cherry picked and the ECC-on/ECC-off are not the same cards, the ECC-on card has less memory (about 12% less).

Agree with the point that with new 3.1 library, there's only marginal difference, but they are actually the same card

It's just that with ECC-on it devotes a certain percentage of memory to storing checksumming data. This is different than the computer arrangement that uses an extra chip to hold the checksumming data so you don't lose capacity (although you do pay slightly more)
 

NP Complete

Member
Jul 16, 2010
57
0
0
I think I'll have to agree to disagree with you, tynopik. I agree with most of your points (memory errors are bad, ECC either detects for fixes most of these errors, etc), however I still believe you're over emphasize both the chance of corruption (you've provided no hard numbers) and that the chance of corruption vs extra cost is worth it.

There are many devices in the a computer that can cause corruption: the SRAM and registers in the CPU are by no means immune to bit flips. Hard drives are not immune to bit rot. Network cards, cables and routers can corrupt information. In all these cases steps are taken to minimize the chance of corruption, but there is no 100% error free part. I'd assert that non-ECC memory in current usage models has an acceptable error rate when compared to the error rates of other consumer grade parts. A transient error in the CPU, network card, or hard drive can be just as devestating for data integrity as problems with memory.

Also, if you read the google paper, you'll see that not all ECC dram chips use have the same rate for uncorrectable errors. The paper specifically mentions chipkill vs non-chipkill memory. Once again, ECC implementations are all about cost vs benefit. Being able to correct more errors requires more bits - and more cost.

If ECC were "free", then by all means I'd advocate putting it in all devices. But it's not - there's cost both with manufacturing and with development. You make some good points, but your position of "ECC for everyone" without providing a cost/benefit analysis at even a basic level (e.g. "I'd pay $100 to guarentee no corruption") makes your points come across as zealotry and make it difficult to have a rational discussion.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
however I still believe you're over emphasize both the chance of corruption (you've provided no hard numbers)

? the google study is full of hard numbers

over 8% of all DIMMs suffer at least one error in a year

and those that have at least one error usually have more than one

and that's server-grade memory. I dare say consumer memory is even worse.

and that the chance of corruption vs extra cost is worth it.

you sound like one of those morbid risk assessment people for car manufacturers who compare the cost X of fixing a defect vs cost Y of settling the lawsuits from people they kill and maim.

There are many devices in the a computer that can cause corruption: the SRAM and registers in the CPU are by no means immune to bit flips. Hard drives are not immune to bit rot. Network cards, cables and routers can corrupt information. In all these cases steps are taken to minimize the chance of corruption, but there is no 100% error free part.

Obviously it can never be perfect, but that doesn't mean it can't be SUBSTANTIALLY better.

Saying "well we can't make it 100% better, so we shouldn't try to make it 50% better" is a copout

ECC memory is the most cost-effective way to further increase the data integrity of modern systems.

Also, if you read the google paper, you'll see that not all ECC dram chips use have the same rate for uncorrectable errors. The paper specifically mentions chipkill vs non-chipkill memory.

Again, it's not just about correcting errors, it's also about detecting errors. Even if ECC can't correct a given error, it can still detect it and prevent it from being written to disk. Silent data corruption is a huge issue and ECC resolves it.

Once again, ECC implementations are all about cost vs benefit. Being able to correct more errors requires more bits - and more cost.

The cost of ECC is so minimal that if it was made mandatory, most consumers would never even be able to tell a difference.


If ECC were "free", then by all means I'd advocate putting it in all devices.

it's so cheap, it essentially is

You make some good points, but your position of "ECC for everyone" without providing a cost/benefit analysis at even a basic level (e.g. "I'd pay $100 to guarentee no corruption") makes your points come across as zealotry and make it difficult to have a rational discussion.

Cost/benefit for who?

For the manufacturers? Of course it's not worth it for the manufacturers (at least in the short term). If you sent in the wrong taxes and now have to pay a huge penalty, what do they care? They saved $5 and that's all that matters.

On the other hand if you don't get your dream job because your resume suffered an embarrassing glitch, what is the cost to you?
 
Last edited:

NP Complete

Member
Jul 16, 2010
57
0
0
Tynopik, your passion has inspired me to settle this issue with science! I'm going to write up a program and run it over the course of a couple days on whatever system I can access (if I'm lucky, I may be able to sneak it into my works test lab and snag a few hundred machines for a few hours - I don't think they have ECC on all of our test machines ;)

Wikipedia has some breakouts of DRAM error rates that seem to be extrapolated from the google paper - the number I've seen is 3-10x10^-9 bit errors/hour: http://en.wikipedia.org/wiki/ECC_memory - this roughly translates into 1 error per GB per hour at the exterme end! If there were a 1 to 1 correlation between memory errors and corrupted documents, this would be horrible indeed (something I would gladly pay to avoid). However, I bet that the observed error rate of propogated errors is actually far lower.

Not all data corruption is equal: if my screen buffer has a bad pixel, the effect is transient and I don't care (especially because the "problem" will likely be fixed in the next screen refresh). If, say, my resume gets corrupted after I edit it, that's much worse. Operating systems, programs, codecs, often employ checksums and the like to determine data integrity as well, which would reduce the chance for catastrophic errors.

I'd purpose the following program:
1) Read data from specified source (disk or network)
2) Check the data is correct
3) Compute next value (e.g. increment value)
3) Write data to source (disk or network)
Essentially a variation on memcheck, but adding all the other subsystems in as well. I'm interested in the question: how often does data get corrupted on a consumer's system in an actual real life scenario? Do you have any suggestions to improve the experimental method?

I'll do a little research first, and see if there is anything published about overall corruption rates.
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
Not all data corruption is equal: if my screen buffer has a bad pixel, the effect is transient and I don't care (especially because the "problem" will likely be fixed in the next screen refresh). If, say, my resume gets corrupted after I edit it, that's much worse.

The likelihood of an error in memory resulting in data corruption is dependent on several factors.

Obviously the size of the document is the most important. The more data being written to disk, the more likely it is to get hit by an error.

But errors can also hit the program and corrupt the saving routine or they can impact the OS itself and the various IO routines there.

What would be a representative test? I have no idea.

Something that might stress the systems would be to compress a large amount of files to an archive then extract the files to a different directory and run a file compare. Delete the archive and the extracted folders and repeat.

Of course this doesn't distinguish between memory, cpu or disk errors, but it gives some idea of the total error rates systems might experience.
 
Last edited: