Why isn't ECC memory used more?

Discussion in 'Memory and Storage' started by thewhat, Mar 26, 2012.

  1. Phynaz

    Phynaz Diamond Member

    Joined:
    Mar 13, 2006
    Messages:
    8,831
    Likes Received:
    33
    Such a device would fail validation and would not be installed / offered for sale, as presented in this presentation from Samsung.
    http://books.google.com/books?id=9G...ce=gbs_ge_summary_r&cad=0#v=onepage&q&f=false

    Or in this paper on enterprise class memory testing:
    http://www.smartm.com/files/salesLiterature/dram/smart_whitepaper_sbe.pdf

    If one of these "weak cells" does get by testing, correcting it doesn't do you any good because the entire module is at high risk of latent failure and should be replaced.

    I'd be interested in any references you can provide that states ECC memory is designed to correct hardware failure.
     
    #51 Phynaz, Mar 29, 2012
    Last edited: Mar 29, 2012
  2. NP Complete

    NP Complete Member

    Joined:
    Jul 16, 2010
    Messages:
    57
    Likes Received:
    0
    Quite true - depending on the error correction (including Chipkill) errors in the ECC codes themselves aren't fatal.

    However, not all ECC memories implement Chipkill, a metric I believe the google DRAM study addresses - http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

    From memory (my own), the chipkill chips had a lower incidence of uncorrectable errors, while those with "other" ECC mechanisms had a higher rate of uncorrectable errors.

    Basic point - ECC isn't always a guarentee of single bit error immunity (but it can be).
     
  3. VirtualLarry

    VirtualLarry Lifer

    Joined:
    Aug 25, 2001
    Messages:
    33,315
    Likes Received:
    24
    This is not true. There is a mathematical relationship between all of the bits in an ECC word.
     
  4. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    memory degrades over time, so what passes validation initially might present problems later

    please be precise. what 'hardware failure' are you talking about exactly?
     
  5. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    To use the car analogy, hyperthreading or more CUDA cores give you more power, so are like a bigger engine. There is no requirement that your car comes with a 300hp engine. Manufacturers can offer models with 300hp, 200hp, 150hp and even 80hp at different price points. Nothing wrong with that.

    However even the 80hp model is required to pass safety tests.

    I never said they would crash every day.

    you mean as high as real world testing shows they are?

    exactly!

    So what if your resume has an embarrassing typo, so what if your tax return has a wrong number?

    People still store vital and critical information on their computers and they should be assured that it won't be corrupted.

    It should be mandatory. Consumers should never have to know or care what ECC is, they should just know that their computers work and are reliable.

    Just like consumers should never have to understand what makes for good gas tank design, they should just know that it won't explode in an accident.

    Safety is not an option.

    In case of an uncorrectable error, ECC prevents bad data from being written to disk in the first place
     
  6. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    http://en.wikipedia.org/wiki/SECDED

    Thus errors in the ECC code can be detected and discarded.
     
  7. Phynaz

    Phynaz Diamond Member

    Joined:
    Mar 13, 2006
    Messages:
    8,831
    Likes Received:
    33
    You know exactly what I mean.

    Memory hardware failure.
     
  8. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    no i have no idea what you mean

    that's like saying 'car failure', it could mean anything

    cells can become weak, meaning that sometimes they're read incorrectly

    bits can become stuck so they're always the same value

    the soldering on a chip might go bad causing intermittent problems

    you might take a chisel and knock a chip off a dimm

    all those are covered under 'memory hardware failure', it's so generic as to be meaningless
     
  9. Phynaz

    Phynaz Diamond Member

    Joined:
    Mar 13, 2006
    Messages:
    8,831
    Likes Received:
    33
    Others here have shown you definitive evidence of what ECC is designed to do.

    You refuse to post evidence of what you say. I wonder why?
     
  10. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    stop speaking in riddles. what exactly am i saying that you disagree with? that ECC will correct single-bit errors? that ECC will detect double-bit errors? That ECC memory doesn't silently fail like non-ECC memory?
     
  11. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    ECC isn't JUST about correcting errors, it's also about DETECTING errors in the first place.

    You're exactly right that a module with one failure is likely to have more. The Google paper agrees, errors are highly correlated.

    ECC provides you with 'early warning' that these modules need to be replaced.

    Without ECC the errors will get worse and worse and your system will get more and more unstable and more and more data will be corrupted and you'll have no clue as to why.

    With ECC you'll know there's a problem with the memory and can get that module replaced before it gets too bad.
     
  12. fuzzymath10

    fuzzymath10 Senior member

    Joined:
    Feb 17, 2010
    Messages:
    516
    Likes Received:
    0
    Considering the number of memory modules out there, I think in aggregate the rate of occurrence of defective modules is low enough that nobody cares. ECC is not a magical solution; if too many errors occur, ECC might mistakenly think it's legitimate; this is extremely improbable. Hard drive specs usually state an error rate corresponding to the probability that enough bits are flipped that the ECC cannot even detect the error and it's something like 1 in 10^18 (can't remember for sure).

    I studied basic coding theory as an undergrad and I'm pretty sure that error correcting codes all have a cost; you need extra bits to be able to detect and/or correct errors (and how they're utilized characterizes each different coding method; some codes are much more efficient than others) and this should increase the raw materials required to produce an ECC module, even though the market price could fluctuate.

    The vast majority of errors are caused by crappy software, running hardware out of spec (e.g. messing with voltages, overclocking, etc.), or some random hardware incompatibility that ECC would not fix. The feature IS available if it is worth the money to you, and if not, there's always AMD.
     
  13. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    Consumers shouldn't have to know anything about coding theory to be confident that their computer is not corrupting data.

    It's the same reason there are government standards for food safety that restaurants are required to adhere to and that there are government standards for car safety that auto manufacturers have to adhere to.

    We don't say, 'Well, if not getting salmonella is important to you, you can go to the restaurant with independent certification that it's clean.' No! All consumers at all restaurants deserve to be safe. There is no option.

    Without oversight, companies have an economic incentive to cut corners and if a company can save 2 cents by allowing your tax return to get corrupted, they'll do it 10 times out of 10.

    Here, Intel needs to take the position of 'the government' by REQUIRING ECC memory for ALL their chips. They shouldn't even allow companies the option to cut corners in this regard. This creates a level playing field for companies to compete on grounds other than safety.
     
  14. paperwastage

    paperwastage Golden Member

    Joined:
    May 25, 2010
    Messages:
    1,847
    Likes Received:
    1
  15. Ben90

    Ben90 Platinum Member

    Joined:
    Jun 14, 2009
    Messages:
    2,866
    Likes Received:
    0
    Hmmm weird. I think your UDP packet got corrupted on the way to the server. We should use TCP links all the time because its more reliable.
     
  16. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    1. ECC doesn't cost $100 extra
    2. You're confusing uptime and correctness

    Consumers don't need five 9 uptime so paying for dual powersupplies and similar redundancy doesn't make sense.

    But they still need data integrity. If the computer is running, it should not be corrupting their data.
     
    #66 tynopik, Mar 30, 2012
    Last edited: Mar 30, 2012
  17. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    UDP is used where data loss is acceptable

    For many applications on a computer, data loss is not acceptable.

    Saying that because there are SOME network applications where data loss is acceptable, therefore ALL computer applications are ok with data corruption is just nonsense.
     
    #67 tynopik, Mar 30, 2012
    Last edited: Mar 30, 2012
  18. Ben90

    Ben90 Platinum Member

    Joined:
    Jun 14, 2009
    Messages:
    2,866
    Likes Received:
    0
    Yes its cherry picked.
    [​IMG]
     
  19. anikhtos

    anikhtos Senior member

    Joined:
    May 1, 2011
    Messages:
    289
    Likes Received:
    0
    i agree with tynopic on these
    okey ecc is not magic
    ecc can not correct everything
    but frankly if ecc was that crap as you apear to be why servers bother to use it??
    so ecc is good for something
    and for crying out loud noone said that a consumer pc must be up 100% as a server.
    but if a small cost of 20-30 euro or ven less in cost of memory
    can ensure a more stable pc
    and prevent data corruption then yes it must be enforced as minimal.
    after all if consumer really know that there is a chance of having their data corrupted by ram i think they would be more sceptical about the computers and storing valuable data.
    well even a single bit eror can make a doc not to open.

    we always talk about cost and cost and cost
    it reminds me when new mobo get off the floppy
    and before you start if you want to build a system with winXP
    you need to feed with drivers so you need a floppy to go along
    so i buy a new 200 euro mobo with no floppy they saved what 1 euro???
    and i have to pay 40 euro for a usb floppy
    so the company made one more euro and i lost 40
    wow
    and if ecc ram will be the norm then there will not be a non ecc ram to compare and say ohhhhhh that is 10 euro cheaper
    so there will not be a price penalty to consider
    and such a small price to change attitude can go unnoticed (well in these forum everyone is having so much expensive hardaware and we are debating for 10 euro?)
    my thoughts
     
  20. CLite

    CLite Golden Member

    Joined:
    Dec 6, 2005
    Messages:
    1,719
    Likes Received:
    0
    http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf

    No shit it's cherry picked and the ECC-on/ECC-off are not the same cards, the ECC-on card has less memory (about 12% less). The 4th slide shows cuFFT 3.1 which has ECC-on/ECC-off almost the same, you linked an outdated driver (cuFFT 3.0) with overall worse performance. So either you are oblivious or you are disingenuous.
     
  21. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    Agree with the point that with new 3.1 library, there's only marginal difference, but they are actually the same card

    It's just that with ECC-on it devotes a certain percentage of memory to storing checksumming data. This is different than the computer arrangement that uses an extra chip to hold the checksumming data so you don't lose capacity (although you do pay slightly more)
     
  22. NP Complete

    NP Complete Member

    Joined:
    Jul 16, 2010
    Messages:
    57
    Likes Received:
    0
    I think I'll have to agree to disagree with you, tynopik. I agree with most of your points (memory errors are bad, ECC either detects for fixes most of these errors, etc), however I still believe you're over emphasize both the chance of corruption (you've provided no hard numbers) and that the chance of corruption vs extra cost is worth it.

    There are many devices in the a computer that can cause corruption: the SRAM and registers in the CPU are by no means immune to bit flips. Hard drives are not immune to bit rot. Network cards, cables and routers can corrupt information. In all these cases steps are taken to minimize the chance of corruption, but there is no 100% error free part. I'd assert that non-ECC memory in current usage models has an acceptable error rate when compared to the error rates of other consumer grade parts. A transient error in the CPU, network card, or hard drive can be just as devestating for data integrity as problems with memory.

    Also, if you read the google paper, you'll see that not all ECC dram chips use have the same rate for uncorrectable errors. The paper specifically mentions chipkill vs non-chipkill memory. Once again, ECC implementations are all about cost vs benefit. Being able to correct more errors requires more bits - and more cost.

    If ECC were "free", then by all means I'd advocate putting it in all devices. But it's not - there's cost both with manufacturing and with development. You make some good points, but your position of "ECC for everyone" without providing a cost/benefit analysis at even a basic level (e.g. "I'd pay $100 to guarentee no corruption") makes your points come across as zealotry and make it difficult to have a rational discussion.
     
  23. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    ? the google study is full of hard numbers

    over 8% of all DIMMs suffer at least one error in a year

    and those that have at least one error usually have more than one

    and that's server-grade memory. I dare say consumer memory is even worse.

    you sound like one of those morbid risk assessment people for car manufacturers who compare the cost X of fixing a defect vs cost Y of settling the lawsuits from people they kill and maim.

    Obviously it can never be perfect, but that doesn't mean it can't be SUBSTANTIALLY better.

    Saying "well we can't make it 100% better, so we shouldn't try to make it 50% better" is a copout

    ECC memory is the most cost-effective way to further increase the data integrity of modern systems.

    Again, it's not just about correcting errors, it's also about detecting errors. Even if ECC can't correct a given error, it can still detect it and prevent it from being written to disk. Silent data corruption is a huge issue and ECC resolves it.

    The cost of ECC is so minimal that if it was made mandatory, most consumers would never even be able to tell a difference.


    it's so cheap, it essentially is

    Cost/benefit for who?

    For the manufacturers? Of course it's not worth it for the manufacturers (at least in the short term). If you sent in the wrong taxes and now have to pay a huge penalty, what do they care? They saved $5 and that's all that matters.

    On the other hand if you don't get your dream job because your resume suffered an embarrassing glitch, what is the cost to you?
     
    #73 tynopik, Mar 30, 2012
    Last edited: Mar 30, 2012
  24. NP Complete

    NP Complete Member

    Joined:
    Jul 16, 2010
    Messages:
    57
    Likes Received:
    0
    Tynopik, your passion has inspired me to settle this issue with science! I'm going to write up a program and run it over the course of a couple days on whatever system I can access (if I'm lucky, I may be able to sneak it into my works test lab and snag a few hundred machines for a few hours - I don't think they have ECC on all of our test machines ;)

    Wikipedia has some breakouts of DRAM error rates that seem to be extrapolated from the google paper - the number I've seen is 3-10x10^-9 bit errors/hour: http://en.wikipedia.org/wiki/ECC_memory - this roughly translates into 1 error per GB per hour at the exterme end! If there were a 1 to 1 correlation between memory errors and corrupted documents, this would be horrible indeed (something I would gladly pay to avoid). However, I bet that the observed error rate of propogated errors is actually far lower.

    Not all data corruption is equal: if my screen buffer has a bad pixel, the effect is transient and I don't care (especially because the "problem" will likely be fixed in the next screen refresh). If, say, my resume gets corrupted after I edit it, that's much worse. Operating systems, programs, codecs, often employ checksums and the like to determine data integrity as well, which would reduce the chance for catastrophic errors.

    I'd purpose the following program:
    1) Read data from specified source (disk or network)
    2) Check the data is correct
    3) Compute next value (e.g. increment value)
    3) Write data to source (disk or network)
    Essentially a variation on memcheck, but adding all the other subsystems in as well. I'm interested in the question: how often does data get corrupted on a consumer's system in an actual real life scenario? Do you have any suggestions to improve the experimental method?

    I'll do a little research first, and see if there is anything published about overall corruption rates.
     
  25. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    3,806
    Likes Received:
    2
    The likelihood of an error in memory resulting in data corruption is dependent on several factors.

    Obviously the size of the document is the most important. The more data being written to disk, the more likely it is to get hit by an error.

    But errors can also hit the program and corrupt the saving routine or they can impact the OS itself and the various IO routines there.

    What would be a representative test? I have no idea.

    Something that might stress the systems would be to compress a large amount of files to an archive then extract the files to a different directory and run a file compare. Delete the archive and the extracted folders and repeat.

    Of course this doesn't distinguish between memory, cpu or disk errors, but it gives some idea of the total error rates systems might experience.
     
    #75 tynopik, Mar 30, 2012
    Last edited: Mar 30, 2012