DRAM Disturbance Error

Alwyn Chan

Junior Member
Apr 6, 2015
5
0
66
NB: Due to formatting limitations of vBulletin, I had to choice but to exclude certain information in this forum thread. Please download the full article (PDF) here (hosted by Google Drive)


Objectives of this article

  1. To increase awareness of a prevalent and insidious but little-known RAM instability issue;
  2. to beseech manufacturers to tighten their quality control as well as their screening and validation process; and
  3. to encourage the use of robust ECC techniques in all memory and storage technologies.

Description of the RAM instability issue
  1. According to a research paper entitled "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors", the charge of a DRAM cell can be lost when a nearby address is repeatedly activated, thereby causing data corruption. To quote,

    "Activating the same row in DRAM corrupts data in nearby rows… We identify the root cause of disturbance errors as the repeated toggling of a DRAM row's wordline, which stresses inter-cell coupling effects that accelerate charge leakage from nearby rows… DRAM disturbance errors are caused by the repeated opening/closing of a row, not by column reads… Disturbance errors can be exploited by a malicious program to breach memory protection…We conclude that the coupling pathway responsible for disturbance errors may be independent of the process variation responsible for weak cells… Sever-grade systems employ ECC modules with extra DRAM chips, incurring a 12.5% capacity overhead. However, even such modules cannot correct multi-bit disturbance errors… Disturbance errors are a general class of reliability problem that afflicts not only DRAM, but also other memory and storage technologies: SRAM, flash, and hard-disk."
  2. This RAM instability can be exposed by running the "Hammer Tests" (test 13 in Memtest86). Read up this "Hammer Test" on Passmark's website, but here's the pertinent paragraph (amended punctuation slightly):

    "The Hammer Test is designed to detect RAM modules that are susceptible to disturbance errors caused by charge leakage. This phenomenon is characterized in the research paper 'Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors by Yoongu Kim et al'. According to the research, a significant number of RAM modules manufactured 2010 or newer are affected by this defect. In simple terms, susceptible RAM modules can be subjected to disturbance errors when repeatedly accessing addresses in the same memory bank but different rows in a short period of time. Errors occur when the repeated access causes charge loss in a memory cell, before the cell contents can be refreshed at the next DRAM refresh interval… This test 'hammers' rows by alternatively reading two addresses in a repeated fashion, then verifying the contents of other addresses for disturbance errors."
  3. After running a variety of stability tests and varying my system configuration, I conclude that my brand new pair of Corsair Dominator Platinum RAMs is not 100% stable. While it passes Prime95, Intelburntest, Windows Memory Diagnostic Tool and traditional Memtest86 tests, it consistently fails the Hammer Test. The details of my system as well as testing procedure are provided in the link below.
    <space>
  4. I'm not the only one who is experiencing this problem. See this thread entitled "How to relate to errors in Hammer Test 13?"

    I would like, in particular, to draw your attention to post #13 by the Administrator,

    "Many computers are fundamentally (slightly) unreliable in a random ways. Maybe this doesn't matter for home use, but for medical devices, banking systems, flight control systems, etc.. it is a big deal… Equally worrying is that our algorithm for provoking the problem is probably non optimal. Meaning that with prefect knowledge of the addressing scheme on each CPU, the channels in use and ram timings, etc.. we could probably force even more errors. The current algorithm is fairly general and not targeted at any particular RAM setup or CPU."

    as well as by the OP

    "I am guessing it'll blow up slowly. Like the Samsung 840 EVO, who just went into round two. Everyone can measure the impact on 840 EVO. Still, Samsung are dragging their feet, trying to create firmware/software solutions. With Hammer 13, I could guess a myriad of PR speak: Within 'normalized' specifications... Negligible impact with normal usage... and so on :/ But as you rightly pointed out. There are systems where 1 single unintended bit flip can have a major impact. And you can bet many of them are using normal RAM where ECC would be sensible (cost)."
  5. There's a Wikipedia page on this "Row Hammer" issue as well.

Please download the full article in the link provided below for the details of my system and test procedure as well as other information.

Download full article (PDF) here (hosted by Google Drive)




Screenshots and pictures

CPU-Z Screenshot:
qn1t11.jpg


Corsair DRAM modules:
s1om52.jpg


Memtest86 result:
2i9tfu8.jpg
 
Last edited:

videogames101

Diamond Member
Aug 24, 2005
6,783
27
91
I would hazard a guess that repeated row accesses on the same bank like that don't ever happen because we have this thing called a cache. Although it's clearly a problem for lower level/mission critical applications.
 
Last edited:

inachu

Platinum Member
Aug 22, 2014
2,387
2
41
Flipping bits if talked about the past 10 years and people would think you are insane.

It has been noticed to be happening on hard drives as well. Bit flipping is not unique to just ECC devices. Sometimes heat can increase the amount of bit flipping.
Not all computers use ECC. Gaming computers would fair better without it.
 

Mark R

Diamond Member
Oct 9, 1999
8,513
16
81
Flipping bits if talked about the past 10 years and people would think you are insane.

This not just random flipping bits, though.

This problem can result in bits flipping just by reading a memory location enough. This can result in security problems, but allowing an application run under the privilege of a restricted user account, to corrupt memory to which it should not have write access.

This could potentially allow access to OS memory, and therefore permit privilege escalation or data exfiltration from another application.

For example, I could easily imaging an application repeatedly requesting the status of its entries in the page table, until a bit flip occurs in the page table, and the application gets write access to system memory.

While this sort of issue isn't likely to occur under normal usage - it potentially could be triggered by malware, which would permit a whole new class of exploits, which would be difficult to fix, because they are not software bugs.
 

Ketchup

Elite Member
Sep 1, 2002
14,558
248
106
Here is what caught my eye:

to encourage the use of robust ECC techniques in all memory and storage technologies.

Many computers are fundamentally (slightly) unreliable in a random ways. Maybe this doesn't matter for home use, but for medical devices, banking systems, flight control systems, etc.. it is a big deal…

In fact, these devices where it is a big deal would most definitely be using ECC memory now, and since there are applications where it doesn't matter, I don't think all memory and storage technologies are in need of ECC.

In other words, things are fine the way there are.

Also, this thread probably belongs in Memory and Storage.
 

Alwyn Chan

Junior Member
Apr 6, 2015
5
0
66
I would hazard a guess that repeated row accesses on the same bank like that don't ever happen because we have this thing called a cache. Although it's clearly a problem for lower level/mission critical applications.

Not sure I follow you completely.

Just to be sure we are on the same page, I would like to reiterate that the scholarly journal said, "Disturbance errors are a general class of reliability problem that afflicts not only DRAM, but also other memory and storage technologies: SRAM, flash, and hard-disk." While the post is on Dynamic RAM, Static RAM (which are used as CPU cache) are also affected by disturbance errors.​

It has been noticed to be happening on hard drives as well. Bit flipping is not unique to just ECC devices. Sometimes heat can increase the amount of bit flipping.
Not all computers use ECC. Gaming computers would fair better without it.

  1. Exactly. Here is what Anandtech has to say (emphasis is mine):

    Anandtech's review of Crucial M550 (March 2014)
    RAIN is similar to SandForce's RAISE and the idea is that you take some NAND space and dedicate that to parity. Almost every manufacturer is doing this at some level nowadays since the NAND error and failure rates are constantly increasing as we move to smaller lithographies.

    Anandtech's review of SanDisk Ultra II (September 2014)

    Using parity as a form of error correction has become more and more popular in the industry lately. SandForce made the first move with RAISE several years ago and nearly every manufacturer has released their own implementation since then.

    ...

    Furthermore, all NAND die have what are called spare bytes, which are additional bytes meant for ECC. For instance Micron's 20nm MLC NAND has an actual page size of 17,600 bytes (16,384 user space + 1,216 spare bytes), so in reality a 128Gbit die is never truly 128Gbit – there is always a bit more for ECC and bad block management. The number of spare bytes has grown as the industry has moved to smaller process nodes because the need for ECC has increases and so has the number of bad blocks. TLC is just one level worse because it is less reliable by its design, hence more spare bytes are needed to make it usable in SSDs.
  2. Indeed temperature does have a play in this, but not as significantly as one might think. In the full article (provided as a downloadable link in my first post), I said that "the academic journal concluded that disturbance errors are not strongly influenced by temperature."

This not just random flipping bits, though.

This problem can result in bits flipping just by reading a memory location enough. This can result in security problems, but allowing an application run under the privilege of a restricted user account, to corrupt memory to which it should not have write access.

This could potentially allow access to OS memory, and therefore permit privilege escalation or data exfiltration from another application.

For example, I could easily imaging an application repeatedly requesting the status of its entries in the page table, until a bit flip occurs in the page table, and the application gets write access to system memory.

While this sort of issue isn't likely to occur under normal usage - it potentially could be triggered by malware, which would permit a whole new class of exploits, which would be difficult to fix, because they are not software bugs.

Thank you for highlighting this potential security vulnerability.

In fact, the Project Zero team at Google has published a blog on how to exploit this vulnerability, entitled "Exploiting the DRAM rowhammer bug to gain kernel privileges." Thought you would be interested to read.​

...In fact, these devices where it is a big deal would most definitely be using ECC memory now, and since there are applications where it doesn't matter, I don't think all memory and storage technologies are in need of ECC.

In other words, things are fine the way there are.

Also, this thread probably belongs in Memory and Storage.

  1. As the journal mentioned, "Sever-grade systems employ ECC modules with extra DRAM chips, incurring a 12.5% capacity overhead. However, even such modules cannot correct multi-bit disturbance errors." This is why I said we need more robust ECC techniques. ECC as it is currently implemented is not a panacea.
  2. With due respect, SSDs have already implemented some form of ECC as the aforementioned Anandtech articles have pointed out, so the issue is more severe than you might first think.
 
Last edited:

videogames101

Diamond Member
Aug 24, 2005
6,783
27
91
Not sure I follow you completely.
Just to be sure we are on the same page, I would like to reiterate that the scholarly journal said, "Disturbance errors are a general class of reliability problem that afflicts not only DRAM, but also other memory and storage technologies: SRAM, flash, and hard-disk." While the post is on Dynamic RAM, Static RAM (which are used as CPU cache) are also affected by disturbance errors.

My point is that in a system where memory accesses are being performed by a CPU with a cache structure and virtual memory system, this sort of "row hammer" access pattern occurring on the DRAM side of the cache is unlikely. (This is just from my own intuition.)

As for the cache SRAM - I don't see any evidence of this "rowhammer" access pattern in particular causing problems with SRAM. In reading the article you linked, I looked at the cited papers regarding SRAM and they seemed to pertain to writability and read stability problems because of process variation on modern nodes. This is not the same as these disturbance errors you see across rows in DRAM. Nor would such an error make sense given the nature of SRAM cells having feedback paths which retain state. The writability vs. read stability balance problems occur on the row you are specifically writing to/reading from. That being said, "Disturbance errors" seems like a very general class of errors which I'm sure you could look at in the context of SRAM, it's just not going to be the same kind of thing as you're seeing with DRAM.

So to reiterate, my overall point is that the "row hammer" access pattern test you are running with memtest likely won't happen in normal system operation. Obviously that's not exactly a fix, but I'm not worried about these errors happening tomorrow in my home PC. Maliciously using these kind of bit flips seems challenging, but potentially doable. To me that's a good enough reason for someone to fix this.

It seems like the industry is trying dealing with the problem, as noted in the new DDR spec and micron docs(1) and in some Intel documents from DDR3 Xeon's(2):

1. JEDEC's DDR4 puts the onerous on DDR device and controller manufactures to make sure victim rows are refreshed within shorter time spans using new targeted row refresh modes, which seems like a reasonable solution. Take a look at micron's documentation. Seems like they do some sort of device binning and write the tMAC[1] into the device. Here is another interesting DDR4 datasheet mentioning this TRR (Target Row Refresh) mode: Micron DDR4. Maybe you can try this same sort of test on an X99 system and see if these DDR4 modules really do take care of these kinds of disturbance errors, as micron seems to be claiming.

2. Intel's DDR3 Xeon's since Ivy Bridge have included contingencies for avoiding these kinds of errors. Look at page 13. Their memory controller used their own "pTRR" commands, which only work with specific "pTTR" DDR3 modules (as far as I can tell this isn't in any JEDEC standard). If you use non-pTRR modules, Intel's memory controller halves the normal refresh time to mitigate these row hammer problems. It'd be cool to test this row hammer pattern on Xeon's as well to see how well Intel's DDR3 fix works.

[1]The maximum activate count (tMAC) is the maximum number of activates that a single row can sustain within a time interval of equal to or less than the maximum activate window (tMAW) before the adjacent rows need to be refreshed, regardless of how the activates are distributed over tMAW.

Thanks for bringing this up OP, it's not something I've seen before and now I'm very interested in what the hell is going to actually cause accelerated leakage from adjacent rows. I'm going to go look at DRAM structures again and try to imagine.
 
Last edited:

inachu

Platinum Member
Aug 22, 2014
2,387
2
41
Bit flipping will be solved or should be solved once we move onto the new optical storage format using light instead of electricity and magnetics.

I'm sure that will be at least 30-50 years from now.
 
May 11, 2008
21,714
1,302
126
i was wondering.
Is there also not a small cache inside the memory controller as well ? To prevent this continous row access ? How big is a row with current dram tech,say 4096MB sticks ?
I remember reading about it in the past that Intel used to have a small cache in the northbridge chips at the time where the memory controller was still present in separate chip and not a part of the cpu as today .

OP, looking at the memtest results, i find it very interesting. When i have time i will do same hammer testing myself.

EDIT:
Reading a row is a destructive proces. The row data is written back. Each bit cell is tiny capacitor(simplified view) that is in the picofarad to femtofarad range and discharge rapidly. I can imagine that the reading and writing of a row continously can cause the discharging to increase (perhaps by tunneling electrons ?, i do not know). Then the charge in the capacitor is to low(of the adjacent row) and read wrong by the row amplifier when the refresh period starts again. That seems to me make sense because reducing the time between refreshing seems to reduce the problem.
 
Last edited:
May 11, 2008
21,714
1,302
126
I had done some reading about it and this is what i think is going on. The capacitor in a dramcell is really a pn junction connected in reverse, meaning the voltage is connected with polarity reversed. This causes the p and n junction to increase in width (a few atoms)and a charge can be held here just as with a capacitor with two conducting plates and an insulating dielectric in between. What happens is there is an amount of current leakage, the reverse current.

What i think happens is that when the row voltage is applied and a row is read out, in the adjacent rows the electrical field causes this reverse current to increase by means of tunneling electrons causing the small amount of charge present in the "capacitor" to decrease. This happens always but since dram rows are not accessed as with the row hammer test, there is nothing to worry about. From the provided links, the target row can be made more susceptible by using adjacent rows on both sides of the target row. This is because of the voltage applied resulting in an electric field applied to the adjacent rows one after the other, it is as if this voltage is present and thus field all the time. An electric field does not just stops at one row, it influences everything around it, it just looses its strength with if i remember correctly, the third power.

How do we know wich bit will be susceptible ? We do not. But as it turns out, by "change" , some dram cell have small defects in the crystal or pollution from one of the dopants. This will cause the reverse current to increase a bit. meaning that when a charge is stored on all dram cell of a row, one of those cells will be losing its charge more quickly. Now normally, as long as this is within specification and the dram cells are refreshed within specified timings, there is nothing to worry about with normal memory access patterns. You would never know. One could say sort of, the bits flipping are weak dram cells.

This is just a simplified idea, there are also other forms of leakage in a dram cell that cause the charge to decrease.
 
Last edited:

Alwyn Chan

Junior Member
Apr 6, 2015
5
0
66
... Take a look at micron's documentation. Seems like they do some sort of device binning and write the tMAC[1] into the device. Here is another interesting DDR4 datasheet mentioning this TRR (Target Row Refresh) mode: Micron DDR4.

...

2. Intel's DDR3 Xeon's since Ivy Bridge have included contingencies for avoiding these kinds of errors. Look at page 13...

Thank you videogames101 for the extremely detailed reply and for furthering this discussion with the above articles. Wanted to PM you earlier, but I was not allowed to...

It is very heartening to know that Micron and Intel have added in safeguards to protect against these types of errors. These 2 companies earn my respect for writing and publishing such extensive documentations. My next RAM would most likely be Crucial.

Do you have any friends with Xeon chips to run Memtest86? If so, please publish the results here!

I actually wanted to build a Xeon machine but availability is sadly pretty much non-existent in my country (Singapore).

OP, looking at the memtest results, i find it very interesting. When i have time i will do same hammer testing myself.

One month has passed! :biggrin:

Anyway, here's a 3 May 15 Update:

Finally achieved stability

I finally managed to determine a RAM setting that can consistently pass the Memtest86 row hammer test. As you can see from this image, I ran just test 13 for 5 hours and 40 minutes. It passed all 14 tests this time.

attachment.php


I had to reduce the DRAM Refresh Interval drastically, from 5200 to 3000 to achieve 100% Memtest86 stability. This represents a 43% decrease in the time elapsed before the next refresh command is given.

The duration of each refresh cycle (measured by the "Refresh Cycle time"), on the other hand, seemed to have done little to impact stability. I say "seemed to have" because increasing this value didn't eradicate the errors and decreasing this value didn't harm stability either. But without conducting a rigorous and VERY time-consuming experiment, I cannot be sure.

Digging deeper into the performance impact

So the next question you might have is this (at least I did): what is the performance penalty incurred from having to almost double the refresh frequency of the RAM?

To investigate, I ran 2 synthetic benchmarks, IntelBurnTest v2.54 by AgentGod and Realbench v2.4 by ASUS. I chose these 2 tests because a) they both produce a result which can easily be compared (unlike Prime95 benchmark) and b) IntelBurnTest should be able to tease out any performance differences given how compute intensive it is, while Realbench would be able to more accurately demonstrate the performance impact on everyday tasks. I may consider running PCmark 8 if there's a request to.

For IntelBurnTest, I ran the test 10 times and set the RAM utilisation to 4096MB. I then took the most often occurring GFlops (IE, the mode) and rounded it to 1 dp.
For Realbench, I ran the benchmark 5 times. Again, I took the most often occurring System Score and rounded it to 3 sf.

Here are the results of the test, after decreasing the refresh interval from 5200 to 3000.

IntelBurnTest: 112.6 to 113.1 (not sure what went wrong here)
Realbench: 74100 to 73500

Conclusion: There is no performance difference that can be observed from the tests, which is good news because it means I achieved stability seemingly without having to sacrifice performance.
 

Alwyn Chan

Junior Member
Apr 6, 2015
5
0
66
I stand corrected regarding the number of FFT lengths tested in Prime95 when the lower bound FFT size is set to 8k and upper bound set to 4096k, as in the case of blend mode.

After running Prime95 v28.5 for more than 24 hours, I conclude that there are most probably 83 different FFT lengths. I say "most probably" because the results.txt file seems to delete historical records, as if it adopts a "first in first out" housekeeping system. So there might be FFT lengths which have been excluded.

Note also that this number 83 is probably only true of certain versions of Prime95. I say this because the update history of Prime95 states that more FFT sizes/lengths were added in some versions, indicating that newer versions of Prime95 might contain even more FFT lengths.

Here are the 83 FFT lengths tested:

Code:
8
10
12
15
16
18
20
21
24
25
28
32
35
36
40
48
50
60
64
72
80
84
96
100
112
120
128
140
144
160
168
192
200
224
240
256
288
320
336
384
400
448
480
512
560
576
640
672
720
768
800
864
896
960
1024
1120
1152
1200
1280
1344
1440
1536
1600
1680
1728
1792
1920
2048
2240
2304
2400
2560
2688
2800
2880
3072
3200
3360
3456
3584
3840
4000
4096

Update: Good news for us, bad news for corsair.

I purchased a new set of RipjawsX -F3-2133C10D-16GXM for my older Ivy Bridge build (because the corsair modules gave problems). I ran these G.skill modules at XMP without altering the refresh frequency and these G.skill modules did not give any problems with memtest86.

Kudus to G.skill. Shame on Corsair.

While my sample size is obviously too small to draw any conclusions, I'm going with G.skill in future. :colbert:
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
This thread got me thinking so I downloaded the latest memtest and ran the hammer test.

first pass no errors
second pass 1 bit error half way thru
third pass 1 bit error same address
~3hrs total runtime with 4x4GB memory

I exited and saved the report
http://sterlingdesktops.com/MemTest86-Report-20150713-184133.html

I'll experiment some more, this system has been very stable at any other test/use for a long time at my current settings.
Going to loosen up the secondaries and see if I can get it to pass 3 or more runs with zero errors.

thanks for mentioning this hammer test :)
also the uefi memtest looks nice, although it only uses 1 cpu core for the tests on my system.
 
Last edited:

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
I tested further and figured it all out.

I ran the test with relaxed timings and still got the same result: a single bad bit at the same address and only on tests after the first run.

I then started swapping my mem modules with an identical backup I had in another system and determined one was bad.
A single bad bit on one module, problem is solved now and I passed 3 tests with no errors.

I'm going to further examine this "bad" stick in another system to see if I can trigger it there and possibly get it to pass.
I do a lot of reading and tbh i've noticed rare things maybe once or twice a week where a single character would be wrong on a page and if I put the mouse over it it'd change to what it was supposed to be. Maybe that issue will go away.

This hammer test is very usefull. Perhaps the most usefull.
 
Last edited:

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
memtest only supports 1 active cpu on my system, I can't get it to run mt mode.
Anyone else have this issue ?
 

Alwyn Chan

Junior Member
Apr 6, 2015
5
0
66
Thanks for posting your memtest results, soulkeeper. Hopefully more people can run this "row hammer test" and post their results here. I wouldn't be surprised if this is a prevalent problem.

If you are only getting errors on test 13, then lowering the "DRAM Refresh Interval" instead of loosening the secondary/tertiary timings might yield better results.

Can't really help you with running mt mode though. Perhaps try a BIOS update?
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
Thanks for posting your memtest results, soulkeeper. Hopefully more people can run this "row hammer test" and post their results here. I wouldn't be surprised if this is a prevalent problem.

If you are only getting errors on test 13, then lowering the "DRAM Refresh Interval" instead of loosening the secondary/tertiary timings might yield better results.

Can't really help you with running mt mode though. Perhaps try a BIOS update?

I've got 6 identical mem modules, only this 1 stick popped up with issues on test 13.
"DRAM Refresh Interval" is not an option in the bios of any of my AMD systems, but tRFC (recovery delay) was one of the first secondaries I tried changing, ohh well.
Just takes forever waiting for the test to run after every setting change (over 20hrs runtime so far trying different settings).
Guess i'll throw it into the scrab bin soon

Yeah, I think the bios is at fault. It was one of the first EFI bios boards out the door so they likely didn't refine it much.
I've got some mushkin modules I can try with test 13 as well if I got time.

thanks for this thread, i'd have gone on using that faulty stick and blaming software.
 
Last edited: