(Although in practice, the ECC is still needed (probably), because dram chips will continue to go bad in time, and without ECC memory, it is difficult to catch the dodgy dram's as they break).
What you say in your entire post is all true, but I quoted the above because it seems to be a central topic (whether ECC is needed or not); or at least, I wish to clarify this portion so that there can be no doubt why ECC has become an essential portion of "Enterprise-class" hardware.
And we don't actually have to delve into the specifics of the semiconductors involved to determine this, although, as I prefaced, what you said is true. Instead, we can probably frame this in terms of information theory, which isn't too far from the topic since ECC is a product of information theory. Claude Shannon (who invented information theory) stated that
"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point".
Most everything that happens in a computer falls into this 'problem'. Components of the machine pass messages to each other, and depending on which component is communicating with whom, the channel / medium can be less or more reliable / noisy.
For our purposes, the DRAM cells can be thought of as one of the less reliable ones (in that there is a distinct possibility that they may flip due to uncontrollable external factors; but in real life, more unreliable are the old floppy disk drives, but nobody uses them anymore, so it's less useful to use them as an example). But suppose we have a set of group of DRAM cells, with a probability of correctly storing/reading/writing their bits of (1-
p), and incorrectly with a probability of
p. What we're doing is framing this under a
binary symmetric channel model of communication. It's the simplest model, and it's all we really need for the purpose of this conversation.
It's pretty straightforward -> depending on the size of
p, a bit will be flipped from 0 to 1 or 1 to 0 with a probability of
p, and will not be flipped erroneously with a probability of 1-
p.
Now, this is a problem for everything, not just DRAM, or floppy disks, or hard drives.
An engineering solution to decreasing
p, as you have pointed out, will encompass:
1. The use of better / more reliable components, in our specific case, higher quality DRAM cells
2. Stricter QA and/or binning - test each component to a higher margin - run them at higher or lower than expected industry values, to really weed out the non-herculean components among the bunch.
3. Use of better process and manufacturing technology
4. Other enormous feats of engineering.
All of these end up increasing the cost of the product. (Or, as we say in info-theory, "raises the cost of communication", which does not necessarily just imply money). The reality is that an engineering approach, or a "physical solution", will eventually end up having diminishing returns (just like everything), so it's not feasible to always rely solely on the physical solutions available.
Instead, after sufficient marvelous engineering, we now forget about the "physical solutions", and focus on what we can call a "system solution". This is one of the most basic lessons in information theory - how to guarantee reliable communication despite the variability / unreliability / noise of the channel involved. And a system solution to the problem presented by information theory is to have error-correcting codes for the binary symmetric channel. I'm not yet talking about ECC specifically as implemented in server-grade buffered DIMMs. Instead, I'm just talking about error-correcting codes as a concept.
Think about it - at this point in DRAM production, all our tools and technologies involved are not exactly cheap and low-tech. Most (I would say all, but I'm not in a position to say so) engineering solutions to eradicate DRAM-caused soft-errors would yield incremental benefits but noticeable and ever-increasing costs to the final product. They aren't worth the trade-off. In contrast, a system solution to the problem as offered by information theory (specifically, the employment of error-correcting codes) costs practically nothing except for a computational requirement (computations done during encoding, then computations done during the decoding process after transmittal). Think of the cost-benefit ratio of that one. Depending on the size of
p in the first place, an info-theory solution, vs an engineering solution, could single-handedly turn an unreliable channel into a reliable one.
This is why ECC will not go away, and why DRAM manufacturers will not make "more perfect" DRAM. At this point in the game, a physical solution is just insanity. Instead, a system solution such as error-correcting codes is better. It is, by leaps and bounds, more practicable to adopt ECC to protect from soft-errors, rather than just bet solely on more and more expensive components, or demand 100% defect-free components that will run 100% defect-free for 5 years no matter the quality of the environment (temps, electricity, load, etc).
So it's not just because DRAMs will go bad eventually that ECC is needed. It is simply impractical from an engineering perspective to guarantee 100% defect-free components for 100% defect-free operations in X number of years, as opposed to just producing components to a certain (very high) level of quality, and then implementing solutions to complement the already expensive engineering involved when the benefits of more expensive engineering start tapering off, resulting in a product that achieves practically the same reliability, but at a far better cost.