Do CPUs fail stress tests at stock?

JimmiG · Jul 15, 2013

Obviously they won't fail within 24 hours like an unstable overclock, but can you really run something like Prime95 or Aida64 indefinitely, until the CPU dies of old age? E.g. could you assemble a system today, fire up Prime95 and come back 5 years later and have it still chug along without errors (assuming no power outages, fan failures etc.)? Or would even a CPU at stock speeds eventually fail these kind of torture tests?

SOFTengCOMPelec · Jul 15, 2013

It should cope just fine. E.g. Leave it running for 5 years.

Prime95 is a VERY good example, because some of the people who want to be the first, to find a new giant prime number (mersenne prime), do exactly what you describe.

Really, the above is what prime95 is for, searching for big (mersenne) prime numbers, the cpu stress testing is a by product of that work.

SOFTengCOMPelec · Jul 15, 2013

I've probably run prime95 continuously, for up to about 3 months at a time (a while back, when I was partially helping search for undiscovered big prime numbers (mersenne primes).

Similarly, I've left prime95 stress testing a new build, and then got distracted, and it was going fine a week to a month later, on multiple occasions. (I use and have access to multiple computers).

EDIT: When I say prime95, it is called 'Mprime' when run on linux systems.

TuxDave · Jul 15, 2013

run something like Prime95 or Aida64 indefinitely, until the CPU dies of old age?

In your case of a single non-ECC CPU, I would personally debate whether or not a soft error would ultimately limit you from an infinite stream of 100% correct results.

SOFTengCOMPelec · Jul 15, 2013

TuxDave said:
In your case of a single non-ECC CPU, I would personally debate whether or not a soft error would ultimately limit you from an infinite stream of 100% correct results.

I agree, ultimately the soft error rate would limit the max run time.
But the soft error rate these days may be a very long period of time, which may exceed the "computer upgrade because something faster is available" time window e.g. 2..7 years.

I have not been able to find any online reports into the typical/best soft error rate of a modern PC, with fault free memory and/or ECC memory. If anyone knows of any, I would be interested in links.

hackerballs · Jul 15, 2013

like a rug that never gets moved for 5 years.................dirt under it

5 years running Prime95.............NOPE imho

Mark R · Jul 15, 2013

Some of the distributed computing projects kept statistics on mis-reported results. Most of the scientific ones, sent the same data block out to 2 or 3 participants, for cross checking.

They found a low, but definite incorrect result rate. Most came from obviously faulty machines (the same machine producing multiple failed results), or overclocked machines. But at stock, there was a small number of seemingly random 1-off errors.

I seem to remember this from a SETI@home newsletter, but I can't find the actual write-up.

SOFTengCOMPelec · Jul 15, 2013

Mark R said:
Some of the distributed computing projects kept statistics on mis-reported results. Most of the scientific ones, sent the same data block out to 2 or 3 participants, for cross checking.

They found a low, but definite incorrect result rate. Most came from obviously faulty machines (the same machine producing multiple failed results), or overclocked machines. But at stock, there was a small number of seemingly random 1-off errors.

I seem to remember this from a SETI@home newsletter, but I can't find the actual write-up.

Thanks, that is VERY interesting.

small number of seemingly random 1-off errors.

According to publicly available reports from google's huge computing section (mainly about using ECC memory or NOT, rather than soft errors as such), it seems to imply that these "1-off" errors, are really caused because the ram (DDR) is beginning to fail in the PC with the soft error.
I.e. You fix the RAM and those rare soft errors go away.

The $64,000,000 question is, if those PC's with the rare soft errors were given (or had already) 100% fault free ram, would the soft error rate be similar, or would it be virtually non-existent ?

EDIT: So taking the results that you are remembering, that would mean that prime95 (MPrime) would probably fail on a stock (not-overclocked) machine, after about 2 Months .. 4 years then (total guesstimate). Due to soft errors.

dguy6789 · Jul 15, 2013

Any pc that is built properly will run prime 95 for the life of the system without a single error until a part fails. This is fact. If an error occurs it was either a user error made when the computer was set up or something was defective from the get go like memory or the motherboard.

SOFTengCOMPelec · Jul 15, 2013

dguy6789 said:
Any pc that is built properly will run prime 95 for the life of the system without a single error until a part fails. This is fact. If an error occurs it was either a user error made when the computer was set up or something was defective from the get go like memory or the motherboard.

The ram (DRAM = DDR) is actually made up of tiny electrical capacitors.
It is believed that if these are hit by radiation particles/cosmic rays etc, they will rarely flip logic states (They actually store an analogue voltage level), hence a significant cause of rare soft errors.
That's why the original PC's had parity ram, because they were so scared of this effect.

What I'm not clear on, is how often these "soft errors" occur, in a healthy, good quality PC.
Is there one error each hour, week, month, year, life of a computer, never ?

http://en.wikipedia.org/wiki/Soft_error

TuxDave · Jul 15, 2013

SOFTengCOMPelec said:
What I'm not clear on, is how often these "soft errors" occur, in a healthy, good quality PC.
Is there one error each hour, week, month, year, life of a computer, never ?

http://en.wikipedia.org/wiki/Soft_error

You probably want to work backwards from a supercomputer standpoint with X1000s of nodes. In those cases, I wouldn't be surprised to see 1 soft error a day (WITH ECC)

SOFTengCOMPelec · Jul 15, 2013

TuxDave said:
You probably want to work backwards from a supercomputer standpoint with X1000s of nodes. In those cases, I wouldn't be surprised to see 1 soft error a day (WITH ECC)

Thanks.

one per day per 1000 computers (assuming total nodes = cores in PC x 1000 comps)

==>> 1000 days = few years (approx)

That would mean a very rough rate of one soft error every few years.
Which is almost exactly what the following report says :-

The following report seems to give the soft error rate, relating to a PC, of about 1 error per 4 to 5 years.

Report into soft errors

It seems that a typical FIT value of 1500 per chip is reasonable based on the litterature. This would give one atmospheric neutron-induced soft-error every 4 to 5 years.

EDIT: I'm aware the rate (soft errors) that you get in practice depends on things, such as how much ram your program uses. So the figures above are VERY, VERY approximate, as it depends a lot on the program and other factors.

Idontcare · Jul 15, 2013

SOFTengCOMPelec said:
EDIT: I'm aware the rate (soft errors) that you get in practice depends on things, such as how much ram your program uses. So the figures above are VERY, VERY approximate, as it depends a lot on the program and other factors.

My personal experience with developing and producing dram is from ages ago (worked on 16 and 64 mbit chips in the mid-90s) but what I remember at the time was that while soft error rates were obviously dependent on the rate of flux of background radiation (worse for higher altitudes and certain geographic locations) what really drove the rates higher, the bulk of the soft-errors, were essentially the one-off intrinsically weak cell in the array.

Meaning you might have 1000 nodes giving you a statistically significant (high CI) soft-error rate of 1 correctable error per day but what you would find when you dug into that 1 error per day was that it was not randomly distributed across the nodes as one might expect.

Rather, what you would find is that there were maybe 10 nodes of the 1000 nodes that were responsible for 95% of the daily soft-error rates, and the soft-error rates for the other 990 nodes were in the years and years part of the spectrum.

And if you dug into the 10nodes and farther parsed where the soft errors were manifesting you would find it wasn't randomly distributed across the billions of dram cells, but it would consistently be the same handful of dram cells that were prone to experience soft-errors.

In other words, the bulk of the soft-error rate average comes from just a handful of intrinsically weak and prone cells that yield an error at the drop of a hat.

What this meant in reality is that we were using ECC to basically mask chips that should have been binned out during verification as being a sort of walking wounded chip. Instead we leveraged ECC to make it seem like the nearly non-functional chip was functional enough to make it through QRA.

Now then, this was what? 20 yrs ago? So it may no longer hold true as the dominant source of soft-errors in todays drams and srams, but I would not be surprised if it were still the same today as it was then.

So boiling all that down to what that meant back then was that you really only needed ECC if you were to happen to be one of the unlucky guys who ended up with a stick or two of ram that happened to have a weak link cell on its array. If you were not one of the unlucky then your intrinsic soft-error rate would be so silly low that you really had no need for ECC in the first place.

SOFTengCOMPelec · Jul 16, 2013

Idontcare said:
My personal experience with developing and producing dram is from ages ago (worked on 16 and 64 mbit chips in the mid-90s) but what I remember at the time was that while soft error rates were obviously dependent on the rate of flux of background radiation (worse for higher altitudes and certain geographic locations) what really drove the rates higher, the bulk of the soft-errors, were essentially the one-off intrinsically weak cell in the array.

Meaning you might have 1000 nodes giving you a statistically significant (high CI) soft-error rate of 1 correctable error per day but what you would find when you dug into that 1 error per day was that it was not randomly distributed across the nodes as one might expect.

Rather, what you would find is that there were maybe 10 nodes of the 1000 nodes that were responsible for 95% of the daily soft-error rates, and the soft-error rates for the other 990 nodes were in the years and years part of the spectrum.

And if you dug into the 10nodes and farther parsed where the soft errors were manifesting you would find it wasn't randomly distributed across the billions of dram cells, but it would consistently be the same handful of dram cells that were prone to experience soft-errors.

In other words, the bulk of the soft-error rate average comes from just a handful of intrinsically weak and prone cells that yield an error at the drop of a hat.

What this meant in reality is that we were using ECC to basically mask chips that should have been binned out during verification as being a sort of walking wounded chip. Instead we leveraged ECC to make it seem like the nearly non-functional chip was functional enough to make it through QRA.

Now then, this was what? 20 yrs ago? So it may no longer hold true as the dominant source of soft-errors in todays drams and srams, but I would not be surprised if it were still the same today as it was then.

So boiling all that down to what that meant back then was that you really only needed ECC if you were to happen to be one of the unlucky guys who ended up with a stick or two of ram that happened to have a weak link cell on its array. If you were not one of the unlucky then your intrinsic soft-error rate would be so silly low that you really had no need for ECC in the first place.

Your conclusions (bolded section) are amazingly similar to the conclusions in the google report, linked below. (It probably cost google a small fortune to get that report done, so I'm impressed).
Which is that a small number of weak/faulty/poor/flakey dram(DDR) chips are the main cause of "soft errors" in dram.

If these "faulty" chips are replaced, there is little need for ECC. (Although in practice, the ECC is still needed (probably), because dram chips will continue to go bad in time, and without ECC memory, it is difficult to catch the dodgy dram's as they break).

Google report on DRAM soft errors in the field

Could it be that "leaky" dram memory cells (capacitors), move to a voltage close to the point at which the wrong logic state will be read (due to excessively high leakage current for that particular capacitor), so that the noise margin is only (e.g.) 50 millivolts at just before the dram refresh period (perhaps 4 msec).

I.e. The flakey dram chips only JUST passed the refresh period tests, and if a higher refresh period had been tested, they would have failed (probably).

Obviously, in practice, the semiconductor manufacturer would have used a safety margin, to ensure that as drams age (and other factors), they do not easily fall out of specification, and it tends to make the chips more reliable (further away from minimum specifications).

Semiconductor manufacturers are VERY secretive (I believe) about the size of these safety margins, but I get the impression, that these margins are usually about 25%..35% (less than 20% for poor quality manufacturers, 50% or more for VERY high quality manufacturers).

Some parameters can be considerably well within specification, but the % I am talking about, is the % which if not met on the production line testing, will result in the component being rejected or reclassified as an inferior spec chip.

Hence if the interfering magnetic/electrical/cosmic-rays/radiation only needs to move the voltage by a further 50mV (in the wrong direction) to cause a 'logic flip'. (i.e. The very high speed analogue comparators, will miss-read the logic state, as the voltage has drifted so far, that they can't reliably decode its logic state)

If the above is right, then doubling the refresh rate (to 2ms), might eliminate "soft errors" almost entirely. Only for test purposes (test what mechanism is really behind these 'errors'), NOT a real life proposed solution, and there could well be other hardware things which cause dram soft errors.

Similarly, increasing the refresh period to 6ms (although out of spec, ONLY for test purposes), should massively increase the soft error rate.

jvroig · Jul 16, 2013

SOFTengCOMPelec said:
(Although in practice, the ECC is still needed (probably), because dram chips will continue to go bad in time, and without ECC memory, it is difficult to catch the dodgy dram's as they break).

What you say in your entire post is all true, but I quoted the above because it seems to be a central topic (whether ECC is needed or not); or at least, I wish to clarify this portion so that there can be no doubt why ECC has become an essential portion of "Enterprise-class" hardware.

And we don't actually have to delve into the specifics of the semiconductors involved to determine this, although, as I prefaced, what you said is true. Instead, we can probably frame this in terms of information theory, which isn't too far from the topic since ECC is a product of information theory. Claude Shannon (who invented information theory) stated that "The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point".

Most everything that happens in a computer falls into this 'problem'. Components of the machine pass messages to each other, and depending on which component is communicating with whom, the channel / medium can be less or more reliable / noisy.

For our purposes, the DRAM cells can be thought of as one of the less reliable ones (in that there is a distinct possibility that they may flip due to uncontrollable external factors; but in real life, more unreliable are the old floppy disk drives, but nobody uses them anymore, so it's less useful to use them as an example). But suppose we have a set of group of DRAM cells, with a probability of correctly storing/reading/writing their bits of (1-p), and incorrectly with a probability of p. What we're doing is framing this under a binary symmetric channel model of communication. It's the simplest model, and it's all we really need for the purpose of this conversation.

It's pretty straightforward -> depending on the size of p, a bit will be flipped from 0 to 1 or 1 to 0 with a probability of p, and will not be flipped erroneously with a probability of 1-p.

Now, this is a problem for everything, not just DRAM, or floppy disks, or hard drives.

An engineering solution to decreasing p, as you have pointed out, will encompass:
1. The use of better / more reliable components, in our specific case, higher quality DRAM cells
2. Stricter QA and/or binning - test each component to a higher margin - run them at higher or lower than expected industry values, to really weed out the non-herculean components among the bunch.
3. Use of better process and manufacturing technology
4. Other enormous feats of engineering.

All of these end up increasing the cost of the product. (Or, as we say in info-theory, "raises the cost of communication", which does not necessarily just imply money). The reality is that an engineering approach, or a "physical solution", will eventually end up having diminishing returns (just like everything), so it's not feasible to always rely solely on the physical solutions available.

Instead, after sufficient marvelous engineering, we now forget about the "physical solutions", and focus on what we can call a "system solution". This is one of the most basic lessons in information theory - how to guarantee reliable communication despite the variability / unreliability / noise of the channel involved. And a system solution to the problem presented by information theory is to have error-correcting codes for the binary symmetric channel. I'm not yet talking about ECC specifically as implemented in server-grade buffered DIMMs. Instead, I'm just talking about error-correcting codes as a concept.

Think about it - at this point in DRAM production, all our tools and technologies involved are not exactly cheap and low-tech. Most (I would say all, but I'm not in a position to say so) engineering solutions to eradicate DRAM-caused soft-errors would yield incremental benefits but noticeable and ever-increasing costs to the final product. They aren't worth the trade-off. In contrast, a system solution to the problem as offered by information theory (specifically, the employment of error-correcting codes) costs practically nothing except for a computational requirement (computations done during encoding, then computations done during the decoding process after transmittal). Think of the cost-benefit ratio of that one. Depending on the size of p in the first place, an info-theory solution, vs an engineering solution, could single-handedly turn an unreliable channel into a reliable one.

This is why ECC will not go away, and why DRAM manufacturers will not make "more perfect" DRAM. At this point in the game, a physical solution is just insanity. Instead, a system solution such as error-correcting codes is better. It is, by leaps and bounds, more practicable to adopt ECC to protect from soft-errors, rather than just bet solely on more and more expensive components, or demand 100% defect-free components that will run 100% defect-free for 5 years no matter the quality of the environment (temps, electricity, load, etc).

So it's not just because DRAMs will go bad eventually that ECC is needed. It is simply impractical from an engineering perspective to guarantee 100% defect-free components for 100% defect-free operations in X number of years, as opposed to just producing components to a certain (very high) level of quality, and then implementing solutions to complement the already expensive engineering involved when the benefits of more expensive engineering start tapering off, resulting in a product that achieves practically the same reliability, but at a far better cost.

Idontcare · Jul 16, 2013

jvroig said:
So it's not just because DRAMs will go bad eventually that ECC is needed. It is simply impractical from an engineering perspective to guarantee 100% defect-free components for 100% defect-free operations in X number of years, as opposed to just producing components to a certain (very high) level of quality, and then implementing solutions to complement the already expensive engineering involved when the benefits of more expensive engineering start tapering off, resulting in a product that achieves practically the same reliability, but at a far better cost.

I agree with your post, and the future will leverage this in more than just memory systems.

Near threshold-voltage logic is also going to critically depend on leveraging built-in error correction to catch errors that transpire from simple thermal fluctuations in the silicon.

To make these logic circuits bullet-proof going forward is going to result in leaving a lot of performance on the table just to build in the margin needed to ensure reliable operation. ECC can not only provide a safety net, but also provide a means to make it work reliably by design by relying on ECC (and other error-correcting algorithms) to cover for the weaknesses on the fly.

SOFTengCOMPelec · Jul 16, 2013

jvroig said:
This is why ECC will not go away, and why DRAM manufacturers will not make "more perfect" DRAM. At this point in the game, a physical solution is just insanity. Instead, a system solution such as error-correcting codes is better. It is, by leaps and bounds, more practicable to adopt ECC to protect from soft-errors, rather than just bet solely on more and more expensive components, or demand 100% defect-free components that will run 100% defect-free for 5 years no matter the quality of the environment (temps, electricity, load, etc).

So it's not just because DRAMs will go bad eventually that ECC is needed. It is simply impractical from an engineering perspective to guarantee 100% defect-free components for 100% defect-free operations in X number of years, as opposed to just producing components to a certain (very high) level of quality, and then implementing solutions to complement the already expensive engineering involved when the benefits of more expensive engineering start tapering off, resulting in a product that achieves practically the same reliability, but at a far better cost.

I agree ECC is very important for Enterprise computing.
Drams (DDR memory), have a significant tendency to go bad, over a period of a few years usage. I don't know the exact figures, but it must be something like 10% .. 30% go bad every few years (my own estimate, easily WRONG).
Without ECC, a number of the "Soft errors" (due to failing memory chips and other factors), may be missed, and they could cause serious "silent" data corruption, which would be especially bad in a commercial environment.

Worryingly, from the google report I linked to in my earlier post, even ECC is not a 100% wonderful solution to the problem. The reason is that if stuff is stored in ram for a significant period of time (in computer terms, rather than wall clock terms), each 'bit flip'/'Soft error' will only be corrected when it is next read.
So if the already single bit soft error, is left uncorrected (i.e. not read yet), there is a chance that another "Soft error" in the same location may occur, but a different bit position, creating an uncorrectable (depending on how powerful the ECC system is, i.e. how many bits are dedicated for the ECC) error, which may now force a system reboot, and/or urgent/emergency handling procedure intervention, implementation dependent.

Apparently there are methods available, which in a timely fashion, periodically flush out these single bit soft errors.

ECC is also good, because it can help identify which ram slot is faulty, and hence needs replacing (depending on site procedure, e.g. 1 uncorrectable memory error fault = immediate replacement).

For my own usage, in principle I would have liked to use ECC memory, but for home use computers (self built), most available motherboard/cpu/DDR3 solutions, don't allow EEC memory.
I am fully aware that some modern (domestic) cpu/motherboard combinations allow ECC memory, but I'm NOT bothered enough about non-ECC to seek them out.
Ultimately, if my computing needs result in so many compute nodes over very extended periods of time, I will have to seriously look into getting server board(s) with ECC memory, or seeking out cheap consumer grade cpu/mobo/ECC viable combinations.

My existing solution is to assume that a (non-overclocked) and otherwise fully working/functional computer, which is now regularly crashing and/or segmentation faults etc, where the identical software works great on other computer(s), has got failed memory (DDR).

When the above problems occur, I will extensively run memtest86 (etc), hence identifying the bad ram slot, and replacing it.
The above has happened to me at least once (a while ago, on a non-overclocked Q6600, DDR2 ram x 4 slots). Replacing/removing the faulty ram module, fixed it, and its been working fine ever since.

caution · Jul 16, 2013

I believe ECC is a rather poorly marketed in the mainstream platforms. An LGA2011 motherboard here and there by Asus, or a hidden and badly documented setting on some AM3+ boards.

It's a shame, cause ECC doesn't cost that much more, and a lot of WS users would want in that tier of products. Server grade motherboards aren't for everyone, after all.

SOFTengCOMPelec · Jul 16, 2013

caution said:
I believe ECC is a rather poorly marketed in the mainstream platforms. An LGA2011 motherboard here and there by Asus, or a hidden and badly documented setting on some AM3+ boards.

It's a shame, cause ECC doesn't cost that much more, and a lot of WS users would want in that tier of products. Server grade motherboards aren't for everyone, after all.

When they designed the (upcoming) DDR4 memory specifications a while ago, it would have been a good time to come up with something which allows the usage of either DDR4-non-ECC and DDR4-ECC, freely interchangeably on ALL platforms that use DDR4.

That is the basic electrical connections and interface, are identical, and the chipset/motherboard/cpu may or may NOT have ECC implemented.

There could easily be technical and cost reasons, making this idea completely UNREASONABLE, but it was just a thought I had.

Back on topic:
At least the tiny microprocessor/microcontroller in our keyboards, never creates any 'soft errors'. sreuugfvukjvfhrggukrukrejfkegrkeqjkhgkjqkehrhergkerkglrkhgkerhgkjehrgkjheteughhrghgkerhgkjhrghrg76ew84th347t893htuj3tj8349jt34t943jtju3t8i49t83.

Idontcare · Jul 16, 2013

caution said:
I believe ECC is a rather poorly marketed in the mainstream platforms. An LGA2011 motherboard here and there by Asus, or a hidden and badly documented setting on some AM3+ boards.

It's a shame, cause ECC doesn't cost that much more, and a lot of WS users would want in that tier of products. Server grade motherboards aren't for everyone, after all.

The economic cost is merely incremental, you add a 9th dram chip to the existing 8 chips and the rest is done with a very simple logic checking controller.

So from a BOM viewpoint, if the market volumes were the same then an 8GB ECC stick of ram ought to only cost 9/8 (~12.5%) more to manufacture than a stick of regular non-ECC ram.

Small price to pay (an extra $5 or $10) in comparison to the total investment of the entire computer itself ($1000 or so these days) given the type of insurance it provides the end-user.

But isn't there a latency adder that comes with ECC? Presumably there is going to be some unavoidable consequence of the serial operation that is required in checking that data is in fact not in error.

SOFTengCOMPelec · Jul 17, 2013

Idontcare said:
The economic cost is merely incremental, you add a 9th dram chip to the existing 8 chips and the rest is done with a very simple logic checking controller.

So from a BOM viewpoint, if the market volumes were the same then an 8GB ECC stick of ram ought to only cost 9/8 (~12.5%) more to manufacture than a stick of regular non-ECC ram.

Small price to pay (an extra $5 or $10) in comparison to the total investment of the entire computer itself ($1000 or so these days) given the type of insurance it provides the end-user.

But isn't there a latency adder that comes with ECC? Presumably there is going to be some unavoidable consequence of the serial operation that is required in checking that data is in fact not in error.

About 12.5% extra ?

What about the following possible cost increases ?

There is a time window, in which the dram stores written data.
To make it easier, let's consider a high speed write burst at exactly 1000MHz i.e. 1 ns per fresh word of data.
Let's also assume that the ECC logic is entirely built into the ram module (to make the calculation easier).

The ECC logic (on our ram module) needs to calculate the correct ECC for the incoming data, and write the ECC codes into the ECC data dram, but it will take a finite time to perform the calculation(s), due to logic delays.

So the ECC data dram(s), may need to be significantly faster specced chips, in order to cope with the shorter time periods.
They may have different number of output bit orientation, and have a different capacity, e.g. to save room, and maybe reduce overall cost.

Because these ECC data drams, are bought at considerably smaller quantities, because they may be much faster specced, and because they have a non-common layout.
The purchase cost for them could be exponentially more, certainly unlikely to be a linear cost increase.

The ECC chip (calculates/checks the ECC logic etc), may have only a very, very limited market (i.e. only for ECC drams, of a particular version of DDR4 or whatever).
Yet the ECC chip probably needs to be extremely high speed (else it will ruin the drams performance), and will only sell in somewhat limited quantities.
It could cost millions of dollars to develop, and have extremely short development cycles (hence costs a lot more).

The manufacturer of the ECC chip (who was taking a risk on the size and popularity of ECC memory, for the limited time that, that type is popular), will charge a suitably highish price, to recoup there expenditure and risks.

Also they would expect perhaps zero competition, as they may be the only ECC chip available.

On the production line, new ECC test equipment has to be developed and created (this can be extremely expensive), and is ONLY usable for the ECC production batch, no use for anything else.
Staff have to be trained and paid to operate the ECC test equipment.

A fact of life is that a significant percentage of DDR rams are probably slightly flakey, and maybe cause the odd random crash every week or month (i.e. the main topic of this thread).
But the typical computer user who finds their PC crashed would probably go "silly computer, or silly Windows 8, or whatever", reboot the computer and carry on.

BUT (assuming ECC is decently implemented in software, with good quality error reporting), if the screen comes up with "Memory Slot3 ECC memory error detected - Please replace slot3 memory, and get repaired immediately"
The user(s) may be more inclined to get the ram DDR replaced under warranty, which could end up costing the company a lot more money in warranty repairs. (In practice the ECC may correct a single bit error, silently, so I could easily be counter-argued here).

The scariness of going the ECC production route could make raising capitol for it highly offputting for investors, which could dramatically raise the cost of raising finance, and hence add to the overall production costs (after factoring this into the costings).

Therefore, it could cost substantially more than 12.5%, in practice.
E.g. Double the cost.

Idontcare · Jul 17, 2013

SOFTengCOMPelec said:
Idontcare said:

The economic cost is merely incremental, you add a 9th dram chip to the existing 8 chips and the rest is done with a very simple logic checking controller.

So from a BOM viewpoint, if the market volumes were the same then an 8GB ECC stick of ram ought to only cost 9/8 (~12.5%) more to manufacture than a stick of regular non-ECC ram.

Small price to pay (an extra $5 or $10) in comparison to the total investment of the entire computer itself ($1000 or so these days) given the type of insurance it provides the end-user.

But isn't there a latency adder that comes with ECC? Presumably there is going to be some unavoidable consequence of the serial operation that is required in checking that data is in fact not in error.

Click to expand...

About 12.5% extra ?

What about the following possible cost increases ?

<snip>

Therefore, it could cost substantially more than 12.5%, in practice.
E.g. Double the cost.

Of course it IS double now and within the constraints set by your post, but that wasn't what I was pointing out.

I said "if the market volumes were the same" (there is your volume-production cost equivalency) then it would be the same BOM cost excepting for the extra dram chip (plus a small IC logic adder).

You are also adding the requirement that they be performance equivalent, which is yet another caveat that I was saying would be a distinction between products when they cost 12.5% difference.

There will be a cost + latency delta, the cost delta is clearly volume dependent, and the latency delta is going to be something the end customer must decide for themselves.

Looking at high clocked DDR3 (non-ECC) at Newegg, clearly one pays a hefty premium for lower latency ram at a given clockspeed, voltage, and capacity.

That would naturally carry over as a price-premium on a stick of ECC ram if you require there to be zero performance penalty with the ECC ram (thus necessity the underyling chips and circuits be internally clocked all the higher such that the serial latency of ECC is hidden from the end-user).

From what I can tell, the vast majority of the expense differential between a regular stick of dram versus that of a registered, or buffered, or ECC stick is that because the volumes are so small (in comparison) the development costs must be amortized over those smaller volumes and that substantially raises the price point.

The bulk of that cost would dissipate if the volumes of ECC ram were to become comparable to that of regular sticks of ram, but you can't escape the BOM adder a 9th chip incurs. (and if you require all 9 chips to be higher-binned so as to hide the latency adder then of course it gets even more expensive as you've altered the contents of the BOM itself at that point)

SOFTengCOMPelec · Jul 17, 2013

Idontcare said:
Of course it IS double now and within the constraints set by your post, but that wasn't what I was pointing out.

I said "if the market volumes were the same" (there is your volume-production cost equivalency) then it would be the same BOM cost excepting for the extra dram chip (plus a small IC logic adder).

You are also adding the requirement that they be performance equivalent, which is yet another caveat that I was saying would be a distinction between products when they cost 12.5% difference.

There will be a cost + latency delta, the cost delta is clearly volume dependent, and the latency delta is going to be something the end customer must decide for themselves.

Looking at high clocked DDR3 (non-ECC) at Newegg, clearly one pays a hefty premium for lower latency ram at a given clockspeed, voltage, and capacity.

That would naturally carry over as a price-premium on a stick of ECC ram if you require there to be zero performance penalty with the ECC ram (thus necessity the underyling chips and circuits be internally clocked all the higher such that the serial latency of ECC is hidden from the end-user).

From what I can tell, the vast majority of the expense differential between a regular stick of dram versus that of a registered, or buffered, or ECC stick is that because the volumes are so small (in comparison) the development costs must be amortized over those smaller volumes and that substantially raises the price point.

The bulk of that cost would dissipate if the volumes of ECC ram were to become comparable to that of regular sticks of ram, but you can't escape the BOM adder a 9th chip incurs. (and if you require all 9 chips to be higher-binned so as to hide the latency adder then of course it gets even more expensive as you've altered the contents of the BOM itself at that point)

I agree. (I've probably moved the goal posts too much here).
E.g. I went for full performance (expensive), you went for more relaxed timings, to keep BOM/price in check.

Analogy
Video tape recorders, were relatively very expensive, when they first became available.

Despite the video heads being VERY complicated, and potentially expensive to manufacture, eventually the huge sales of VTR's, brought their price down to relatively low/cheap levels (VTR's are of course relatively obsolete today).

-----------------------------------------------

Where we perhaps have to differ, is that I don't think the sales of ECC memory sticks, will be anything like on par (equivalent) to non-ECC memory sticks. I think the "man in the street", has never heard of ECC.

So in order for what you describe to happen, it would probably need a major player (Intel, or maybe AMD as well, or better still Intel + AMD + Arm) to agree for all future systems, above a certain price/market point, to have ECC memory sticks as standard.

Then the huge ECC memory sales, would/should bring the price down dramatically, and we can then close this thread with the statement "Soft errors NO LONGER EXIST" (ok, I'm exaggerating here, but I think people get the message).

Idontcare · Jul 17, 2013

SOFTengCOMPelec said:
I agree. (I've probably moved the goal posts too much here).
E.g. I went for full performance (expensive), you went for more relaxed timings, to keep BOM/price in check.

Analogy
Video tape recorders, were relatively very expensive, when they first became available.

Despite the video heads being VERY complicated, and potentially expensive to manufacture, eventually the huge sales of VTR's, brought their price down to relatively low/cheap levels (VTR's are of course relatively obsolete today).

-----------------------------------------------

Where we perhaps have to differ, is that I don't think the sales of ECC memory sticks, will be anything like on par (equivalent) to non-ECC memory sticks. I think the "man in the street", has never heard of ECC.

So in order for what you describe to happen, it would probably need a major player (Intel, or maybe AMD as well, or better still Intel + AMD + Arm) to agree for all future systems, above a certain price/market point, to have ECC memory sticks as standard.

Then the huge ECC memory sales, would/should bring the price down dramatically, and we can then close this thread with the statement "Soft errors NO LONGER EXIST" (ok, I'm exaggerating here, but I think people get the message).

Yeah it is definitely a spectrum of product capability versus price, we are both speaking to the same spectrum it is just we were looking at what it would take (price-wise) to make a product that sits at those different points.

From my perspective we are both in violent agreement with each other, on both accounts

I also agree that my argument is largely academic at best because, as you rightly point out, so long as a product exists which is 11.1% lower in price (or 50% lower in price) which doesn't impart a drastically inferior experience to the majority of end-users then the superior product (regardless the magnitude of the incremental cost adder) is going to be largely ignored and viewed as unnecessary.

I think of this craptastic Dell laptop I am typing this post on when I state that...sure I got a hell of a deal for my $700, but the build quality is truly substandard versus the $2k laptop I purchased in 2002.

And why did I buy a $700 craptastic disposable dell laptop? Because they weren't selling an even craptasticier (sp?

) version for $500 at the time

Non-ECC dram is in no danger of going extinct until the incremental cost adder in absolute terms (be it pennies or mere dollars) of going with equivalent performance ECC ram is essentially negligible to the end-user such that it becomes a "yeah, why not, it can't hurt and it hardly changes the price" type decision.

SOFTengCOMPelec · Jul 17, 2013

Idontcare said:
Yeah it is definitely a spectrum of product capability versus price, we are both speaking to the same spectrum it is just we were looking at what it would take (price-wise) to make a product that sits at those different points.

From my perspective we are both in violent agreement with each other, on both accounts

I also agree that my argument is largely academic at best because, as you rightly point out, so long as a product exists which is 11.1% lower in price (or 50% lower in price) which doesn't impart a drastically inferior experience to the majority of end-users then the superior product (regardless the magnitude of the incremental cost adder) is going to be largely ignored and viewed as unnecessary.

I think of this craptastic Dell laptop I am typing this post on when I state that...sure I got a hell of a deal for my $700, but the build quality is truly substandard versus the $2k laptop I purchased in 2002.

And why did I buy a $700 craptastic disposable dell laptop? Because they weren't selling an even craptasticier (sp? ) version for $500 at the time

Non-ECC dram is in no danger of going extinct until the incremental cost adder in absolute terms (be it pennies or mere dollars) of going with equivalent performance ECC ram is essentially negligible to the end-user such that it becomes a "yeah, why not, it can't hurt and it hardly changes the price" type decision.

I think it would be a marketing nightmare to try and sell ECC ram to the bulk of computer users.

E.g. I (a very long time ago) use to try to persuade people to take regular backups, because hard drives could sometimes dramatically fail.
But I eventually noticed a huge 0% takeup on my backup your data advice.

So these days, unless someone explicitly asks me how and when to take backups, I will keep relatively quiet about it.

If anything, it is probably going to get harder to market ECC, because of the growing handheld/smaller devices/laptop market.

Does anyone want ECC memory in their cellnet/mobile phone, which they only use occasionally ?

Do CPUs fail stress tests at stock?

Platinum Member

Platinum Member

Platinum Member

Lifer

Platinum Member

Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Lifer

Platinum Member

Elite Member

Platinum Member

Platinum Member

Elite Member

Platinum Member

Member

Platinum Member

Elite Member

Platinum Member

Elite Member

Platinum Member

Elite Member

Platinum Member