Why isn't ECC memory used more?

Discussion in 'Memory and Storage' started by thewhat, Mar 26, 2012.

  1. NP Complete

    NP Complete Member

    Joined:
    Jul 16, 2010
    Messages:
    57
    Likes Received:
    0
    Personal attack much? People asking questions, like you, help others develop an understanding. Before this thread, I dismissed ECC as an unnecessary expense for most people. After looking at the numbers I'm a bit more concerned for my own data reliability. But I still don't believe I have the full picture. How often do failures matter? If I start getting failures, are they self-reporting due to a high amount of failures causing overall system instability, cluing me into a problem on my system?

    Intel undoubtedly segments the market for money reasons - Intel is a business. As a business, their primary purpose it to make money. Making money isn't bad in and of itself, especially if it's used to drive research and development. I don't personally know what Intel uses the money for, but you probably don't either.

    Government regulation carries a hefty price - it is usually reserved for areas where there is a blatant mismatch in power (e.g. corporations vs people) and non-compliance represents a serious issue to the public's interest. While there definitely is a discrepancy in power (Intel vs you), data integrity is clearly a private interest. In terms of private interests, the governments involvement should be more limited - it should ensure that any contract or warranty, explicit or implicit, is honored. However, I don't want this to digress into a civics discussion.

    Additionally, assumed liability is no straw man: companies across all industries actively try to avoid liability like a plague (because it can seriously hurt bottom lines). No company is going to assume liability for use of their device in an environment they don't control.

    Memory having corruptions isn't a "fireball" - on top of that even an engine exploding isn't always the manufacturers fault. If it explodes because some one tried to attach an after-market turbo-charger, clearly the one who attached the after-market device is at fault. If your RAM has a high error rate because you undervolt it, it isn't the RAM manufacturers fault, nor is it their fault if your RAM fails due to inadequate cooling.

    Intel is under no obligation to provide ECC - you have in no way demonstrated a clear obligation. The only obligation I can see you could reasonably argue is that an implicit warranty exists that data should last for the life of the device (which I still think is stretching it). Most warranties are for 90 days - some are for 1 year (and in rare cases, 5 years). There is no hard evidence that a user has a large chance of corruption in that time frame. Equally, there's no evidence that ECC provides additional security given the error rates of other consumer grade devices (motherboards, hard disks, power supplies).

    The answer as to why isn't ECC used more is: a) Intel wants more profits and b) because the cost of implementing ECC is more than the loss of business due to customers switching brands due to data loss.

    Intel isn't alone in not using ECC - most consumer grade devices don't use ECC (smart phones, tablets, video game systems). I doubt that using non-ECC is some sort of conspiracy to screw over consumers by every device manufacturer out there. Manufacturers will likely resort to ECC once error rates become high enough to cause a high rate of product failures costing them money due to RMAs and brand degradation. For instance, Intel uses Hamming codes on it's SSDs to protect against a high rate of bit errors in NAND.
     
  2. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    Before attacking me, answer my question: what is the chance that a randomly flipped bit in ram causing a noticeable error in storage? Give me an actual percentage. "15%" being affected doesn't tell me how the effect will manifest. Until you can tell me what the meaningful error rate is, you are just crying wolf. Making mostly unrelated comparisons to airplanes and cat engines don't help your case.

    Also, the point of backups is allowing me to go back on time on a file. If a file gets corrupted, I can retrieve a version of that file from a week ago. Sometimes even yesterday, depending on what it is. If your backup solution can't do that, then it is not a backup.
     
  3. Ben90

    Ben90 Platinum Member

    Joined:
    Jun 14, 2009
    Messages:
    2,866
    Likes Received:
    1
    Omfg you mean you don't have a prossesor with thousands of dollars worth of RAS features? Just think of the safety of your data?! I mean its really no option, you need to have these.

    Consumers shouldn't be given the option to buy such renegade systems such as Sandy Bridge Xeons or i7s. Phenoms and Opterons are just too reckless with our data. We NEED Itanium's and Power 7s. Just imagine if one of your background pictures switched a 173 205 25 pixel to a 174 205 25 pixel. THE CONSEQUENCES WILL NEVER BE THE SAME!!
     
  4. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    Not at all. It's just reality. Market pressures will always lead to sacrificing safety for profits because safety is a hard sell.

    Of course not, but when you put profits ahead of safety you open yourself up to lawsuits

    Notice I'm not suggesting government regulation, I'm suggesting Intel impose this on system builders

    again I'm not arguing for government intervention, but couldn't you just as easily say your health and safety is clearly a private interest?


    Of course they aren't going to try to assume liability. But that doesn't mean they can just discard it with some words in their warranty. If you make a defective product that causes loses, you can and will be held liable no matter what your warranty says.

    it depends on what it corrupts, the results can be just as devastating for some people

    non sequitur, i was clearly talking about instances when a KNOWN DEFECT caused the problem

    of course, but irrelevant to the 99.9% of instances where it has errors at stock

    On the contrary, it's quite easy.

    http://en.wikipedia.org/wiki/Implied_warranty

    All sorts of juicy angles for lawyers to cover. It comes down to that Intel knows there is a problem, knows the solution and yet continues to callously ship defective products that are LIKELY to encounter the problem.

    Have you ever seen some of these successful lawsuits against companies for 'defective' products? Those companies have been held to far higher standards for PREVENTING PROBLEMS than what Intel needs to do here.

    Again you're only looking at the wrong thing. This isn't about CPUs that outright fail. That's the warranty terms you're looking at.

    Implied warranties of fitness for purpose and merchantability and safety however last the life of the product. There is no expiration on them.

    Going back to our favorite car example, if a car comes with a 100k mile powertrain warranty and the engine explodes in a fireball due to a KNOWN DEFECT at 120k miles, do you think the 100k mile limitation on the warranty will do ANYTHING AT ALL to shield the company from liability?

    Of course not.

    When there are hundreds of millions of PCs in use, even small chances become large absolute numbers.

    That makes no sense. ECC provably reduces errors yet you're arguing that even though it reduces errors, it doesn't actually reduce errors.

    And when the jury hears that excuse for why it's ok that Joe lost his job and his future, they're going to nail Intel so hard they won't know what hit them.

    Corporate greed plays very poorly to juries. Something for Intel to think about.

    Of course it's not a conspiracy, it's just market forces.

    Or a holy fear of lawsuits is put into them.

    All drives include similar systems, even hard drives and CDs.
     
  5. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    That's unknowable without a lot more information because it depends on how you use the system.

    Absolutely false. There are hundreds of millions of computers out there, and if you cause random errors on 15% of them, it is guaranteed to cause critical problems for lots of people. Just because we can't put an exact figure on the problem doesn't mean it isn't devastatingly real.

    And what if it gets corrupted on the initial save?

    What if you never realize it was corrupted until the IRS comes after you for tax evasion?

    What if it corrupts a database and you've been adding records that don't hit the problem for days/weeks/months. Going back to the noncorrupt copy isn't an option because of all the data you've entered since then.
     
    #105 tynopik, Apr 7, 2012
    Last edited: Apr 7, 2012
  6. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    and yet, we're not seeing huge numbers of reported corruption issues due to memory errors on consumer computers. maybe your 15% isn't correct, or you are interpreting it wrong.

    cry wolf more. or better yet, get your lawyers ready and take intel to court. if any of this is relevant, you'll do yourself and all of much more good by forcing intel to adopt ECC on consumer chips, than making quote wars on a forum.
     
  7. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    because the typical consumer isn't going to be able to pin it on the memory

    from the manufacturer's perspective, that's the beauty of memory errors. Their very transience makes it a bitch to provably tie a specific problem to a memory error.

    Go tell Google they don't know how to count


    You need a plaintiff that has provably suffered substantial damages due to memory bit errors. That isn't me and that's going to be difficult to find (because of the difficulty of proving anything given the transient nature of memory errors), but difficult doesn't mean impossible.

    And the more time that goes by, the more likely it is that the perfect plaintiff is going to come along and Intel is going to dearly regret ever thinking they could get away with this stunt.
     
  8. VirtualLarry

    VirtualLarry Lifer

    Joined:
    Aug 25, 2001
    Messages:
    35,756
    Likes Received:
    577
    Well, FWIW in this discussion, my NAS holding my photos seems to have developed RAM problems. When viewing pictures, some of them appear corrupted. Rebooting the NAS, and accessing those same pictures from a different, known-stable PC, shows different pictures in that series are corrupted. I have since shut down the NAS, to prevent further corruption.

    But RAM corruption is real, and it's serious. Granted, this seems to be some kind of failure, but it would have been nice, if the NAS used ECC, and could detect corruption/failure, and automagically shut down, or at least gone into some sort of failsafe mode, where it would not operate as a NAS, but would blink a light on the front indicating hardware failure.
     
  9. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    since i wasn't clear the first time: go do something about it, not make quote wars on a forum. you are doing nothing constructive here.

    i'd rather go with "you are interpreting it wrong".
     
  10. taltamir

    taltamir Lifer

    Joined:
    Mar 21, 2004
    Messages:
    13,575
    Likes Received:
    0
    Making money is good, greed is good when it results in willing equitable trade.
    Artificial market segmentation is nothing short of exploitation and is unjustifiable.

    Personally I view it as a lesser evil compared to having the government step in with regulation. But is no justification for the practice, nor does it mean consumers should not apply pressure to the companies involved (a perfectly legitimate free market tactics that does not involve government regulation).

    There is also the possible justification that "government broke it, government fix it". In a sense of, this type of artificial market segmentation is only possible in a monopoly. And said monopoly only exists because it is a government mandated monopoly.
    A patent is a government mandated monopoly. Originally conceived with the best of intentions it went out of control and now companies are patent firms first who, on the side, do some technology work.
    Transmeta was forced to go with an inefficient design to not violate intel patents, nvidia was forced to go with an ARM based APU to not violate intel patents...
    And its not like they were going to steal intel's design, merely make their product compatible.

    wasn't that a "per year" figure?
     
    #110 taltamir, Apr 7, 2012
    Last edited: Apr 7, 2012
  11. NP Complete

    NP Complete Member

    Joined:
    Jul 16, 2010
    Messages:
    57
    Likes Received:
    0
    That really sucks - I'm sorry to hear about your problems. By any chance, are you sure that the issue is the memory and not something else? A bad firmware/os update? A faulty drive? A bad motherboard or a bad NIC? I hope you're still under the manufacturers warranty so you can get it fixed. Also, what brand of NAS were you using?

    I'm not arguing that data corruption doesn't happen - it obviously does, as your misfortune demonstrates. And no doubt, if your issue is due to memory problems, ECC would have been worth the cost (if it was available). The point in contention is if RAM errors are the prime factor in data corruption, and if so, do they cause corruption at rates that break any sort of implied warranty.

    Like I posted earlier, I'm going to recuse myself from the conversation because I don't have enough data to substantiate my claims. The primary data used is from Google. Their machine usage obviously doesn't fit most consumer's use, making it hard to draw direct conclusions about RAM failures for most consumers. Additionally, the data they provide indicates errors show a strong temporal and spatial proximity, meaning that if RAM does fail, it'll likely do so in a fashion that can cause the whole system to become unstable and give users a clue that something is wrong. Finally, the paper doesn't address the underlying of data corruption - the paper only reports on RAM error rates and not their end result.
     
  12. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    then by all mean please enlighten us as to how we're supposed to interpret it
     
  13. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    yup
     
  14. taltamir

    taltamir Lifer

    Joined:
    Mar 21, 2004
    Messages:
    13,575
    Likes Received:
    0
    He is interpreting it wrong. He is grossly under estimating the error rate incidence.

    http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

    How many sticks are erroring? 8% per year. Inevitably every person is affected because:
    1. They use computers for more then one year.
    2. They use multiple computers.
    3. They use multiple DIMMs per computer.

    His 15% is:
    1. Taking a "per year" figure and saying it is "per forever". This is incorrect and grossly under estimates the incidence rate.
    2. Assuming ratio between quantity of DIMMs that places the average at just under 2 ram sticks. (aka, most have 2, some have more, some have 1).
    3. Does not consider people using more then one computer. (ex: PC + Laptop)

    All of this though end up with him underestimating the damage... which could be given as him being "generous" with you saying "its at least 15%" (didn't he explicitly say at least?)
     
    #114 taltamir, Apr 7, 2012
    Last edited: Apr 7, 2012
  15. kevinsbane

    kevinsbane Senior member

    Joined:
    Jun 16, 2010
    Messages:
    694
    Likes Received:
    0
    How much memory is even being taken up by user data anyways? i mean I would think the vast majority of the data in memory is program data, which is much less susceptible to corrupting data being saved to disk. If a critical bit is flipped, then the program crashes and nothing is saved. Application data that gets corrupted silently is almost impossible to propogate as well as it is reloaded on reboot from the main storage (HDD).

    So even if 8% of ram sticks will suffer an error (under 24/7 conditions at that), the 1-2 MB of critical user data stored in a consumer's ram has a 1/1000th chance of that error affecting it. 8% sounds high, but 0.008% per year is pretty darned low for a single bit error for (arguably) non-critical data (ie, business data)
     
  16. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    Not necessarily, instead of a crash, a corrupted program could lead to corruption of user data as it is processed by the program.

    Even if the program is reloaded clean later, the damage might already be done.

    Same thing with corruption of the OS image. Messing with that could result in all sorts of permanent wonkiness even if the OS itself is restored next reboot.
     
  17. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    some interesting things in here:
    http://www.zdnet.com/blog/storage/dram-error-rates-nightmare-on-dimm-street/638

    specifically, this:

    seems like this article is based on the same data. the above suggests that most of the errors are produced by a small subset of machines of a given platform. unless i'm reading it wrong, it appears that much of the memory errors reported were isolated to select machines. or in other words, the 8% per year average does not cover the 80% of the machines unaffected. it also nicely dovetails into previous points, where the article mentions that the error rate is more motherboard dependent. apparently, your choice of motherboard have bigger impact on the likelihood of a memory bit flip than the modules themselves.

    looks like there's still a lot that needs to be investigated before we can call this for certain. in the mean time, it is too early to accuse intel of creating the equivalent of exploding engines.
     
  18. tynopik

    tynopik Diamond Member

    Joined:
    Aug 10, 2004
    Messages:
    4,002
    Likes Received:
    35
    yes, by definition unaffected machines aren't affected by errors. If there were affected, they wouldn't be unaffected . . .

    1. even the 'best' platforms still had substantial error rates
    2. 'Sub-optimal' motherboards is the reality Intel has to deal with. There is a solution that addresses the problem, they intentionally refuse to include it. The responsibility still lies with them. (Not to mention the 'deepest pockets' legal theorem . . .)

    What's to investigate? The memory system as a whole isn't reliable. Whether it's the DIMMs or the motherboard or both, it doesn't matter. It's the real world and Intel needs to deal with it.

    Their repeated failure to do so is a classic case of negligence.

    I would not be surprised at all to see a major class-action lawsuit over this in the next 10 years.
     
    #118 tynopik, Apr 7, 2012
    Last edited: Apr 7, 2012
  19. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    lay off the snark. what i meant was that machines already exhibiting errors are dramatically more likely to have more (hard) errors, where as machines that don't are less likely to have hard errors in comparison. what this means is that most of the 8% consists of machines that have already had hard errors in the past, very little of it consists of machines that never had them before. what this implies is that if your machine do not exhibit high amounts of hard errors now, it will continue this trend in the future.
     
  20. tuprox

    tuprox Member

    Joined:
    Apr 3, 2012
    Messages:
    63
    Likes Received:
    0
    Wow, talk about a little debate here! Glad I posted and thanks for all the responses!
     
  21. taltamir

    taltamir Lifer

    Joined:
    Mar 21, 2004
    Messages:
    13,575
    Likes Received:
    0
    1. Its 8% per DIMM experience errors per year, not 8% per machine. Most machines have 2 DIMMs, some have more.
    2. This 8% figure DOES take into account that errors tend to reoccur in the same DIMMs
     
  22. moriz

    moriz Member

    Joined:
    Mar 11, 2009
    Messages:
    196
    Likes Received:
    0
    1. This changes nothing, since the proportion of affected machines are the same. More or less, anyway.
    2. And most of those machines probably exhibit obvious symptoms due to bad memory, and then subsequently replaced. The point here is that much of that 8% is caused by a small portion of modules, which means that the error rate among the rest is much lower. It's a weighted average.
     
  23. Morg.

    Morg. Senior member

    Joined:
    Mar 18, 2011
    Messages:
    242
    Likes Received:
    0
    Err . no . you wouldn't even remotely think about an x86 cpu, your ram wouldn't simply be ECC, etc. - and besides your server would be in a bunker underground with a much different radiation pattern. but w/e keep w/ random arguments.

    ECC is a feature. that's it, nothing more or less, no comparison at all with shaders or w/e

    IMHO, the deactivation of features for economic purposes is yet another aberration of the capitalist system, where making money > all . but then, why would you get pissed at that when there are so many other aberrations that have major consequences on important stuff. .you know like life for example.
     
  24. taltamir

    taltamir Lifer

    Joined:
    Mar 21, 2004
    Messages:
    13,575
    Likes Received:
    0
    Actually it isn't a capitalistic feature. Capitalism is free market and patents make this a very much not free market.

    http://en.wikipedia.org/wiki/State_monopoly_capitalism
    nVidia, transmeta, and others wanted to compete with intel but are legally barred from doing so (properly) by (a very much broken and unfair; with IP law being pushed by massive donations from interest groups) patent system which grants them a government mandated monopoly.

    In a true capitalistic society what intel is doing wouldn't fly.

    Ironically marx's solution is even worse then the problem it was meant to fix. [sarcasm]Who would have thought that the problem of monopolies fusing with government to get government mandated monopolies would get worse if you decree that all business must be government owned.[/sarcasm]
     
  25. Morg.

    Morg. Senior member

    Joined:
    Mar 18, 2011
    Messages:
    242
    Likes Received:
    0
    Capitalism is a concept that is not and cannot be implemented due to its simplistic approach of human behavior - that much is clear.

    However what happens currently is direct consequence of adopting the idea that every being searching for their own profit will result in a valid working system.

    An idea that is obviously wrong and leads to such dumb things as breaking a chip you just manufactured, destroying stuff to decrease offer and killing people for dark goo.

    My point being, who the f* cares about disabled ECC on Intel chips when the reason behind that already killed millions and slowed down the evolution of the human race to a crawl.