Lead Image © Igor Stevanovic, 123RF.com

Lead Image © Igor Stevanovic, 123RF.com

Error-correcting code memory keeps single-bit errors at bay

Errant Bits

Article from ADMIN 17/2013
By
System memory is extremely important to your applications, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

Data protection and checking occurs in various places throughout a system. Some of it happens in hardware and some of it happens in software. The goal is to ensure that data is not corrupted (changed), either coming from or going to the hardware or in the software stack. One key technology is ECC memory (error-correcting code memory) [1].

The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit errors, it cannot correct them. A simple flip of one bit in a byte can make a drastic difference in the value of the byte. For example, a byte (8 bits) with a value of 156 (10011100) that is read from a file on disk suddenly acquires a value of 220 if the second bit from the left is flipped from a 0 to a 1 (11011100) for some reason.

ECC memory can detect the problem and correct it, while the user is unaware. Note, however, that only one bit in the byte has been changed and then corrected. If two bites change – perhaps both the second and seventh from the left – the byte is now 11011110 (i.e., 222). Typical ECC memory can detect that the "double-bit" error occurred, but it cannot correct it. In fact, when a double-bit error happens, memory should cause what is called a "machine check exception" (mce), which should cause the system to crash.

After all, you are using ECC memory, so ensuring that the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop. The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system.

This interference can cause a bit to flip at seemingly random times, depending on the circumstances. According to a Wikipedia article [1] and a paper on single-event upsets in RAM [2], most single-bit flips are the result of background radiation – primarily neutrons from cosmic rays.

The same Wikipedia article said that the error rates reported from 2007 to 2009 varied all over the map, ranging from 1010 (errors/bit-hr) to 1017 (seven orders of magnitude difference). The lower number is just about one error per gigabit of memory per hour. The upper number indicates roughly one error every 1,000 years per gigabit of memory.

The observations in a study by Google (see the "Correctable Errors" box) indicate that in real-world production, you see much higher error rates than what manufacturers are reporting. Moreover, the rate of correctable errors can be an important factor in watching for memory failure. Consequently, I think monitoring and capturing the correctable error information is very important.

Correctable Errors

A study [3] of real memory errors took place at Google. During their investigations they found that one third of the machines and more than eight percent of the DIMMs saw correctable errors per year. This translates to Google's experiencing about 25,000-75,000 correctable errors (CE) per billion device hours per megabit, which translates to 2,000-6,000 CE/GB-yr (or about 250-750 CE/Gb-yr). This is much higher than the previously reported "high" correctable error rate of 1 CE/Gb-yr (250-750 times higher) and six orders of magnitude higher than the optimistic report.

The study went on to report some other interesting results that bear repeating here.

+ Memory errors are strongly correlated.

  • There is a strong correlation among correctable errors within the same DIMM. A DIMM that has a correctable error is 13-228 times more likely to see another in the same month.
  • An uncorrectable error is preceded by a correctable error 70-80 percent of the time. A correctable error increases the probability of an uncorrectable error by factors of 9-400.
  • Uncorrectable errors following a correctable error are still small at 0.1%-2.3% per year.

+ The incidence of correctable errors increases with age, but the incidence of uncorrectable errors decreases with age.

  • The increasing incidence of correctable errors sets in after about 10-18 months.
  • The most likely reason for uncorrectable errors decreasing is that DIMMs with a large number of correctable errors are replaced, decreasing the likelihood of uncorrectable errors.

+ There is no evidence that newer generation DIMMs have worse behavior (this study was published in 2009).

+ Temperature had a surprisingly low effect on memory errors (over the temperature range tested).

+ Error rates are strongly correlated with utilization.

+ Error rates are unlikely to be dominated by software errors.

Linux and Memory Errors

When I worked for Linux Networx years ago, they were helping with a project that was called bluesmoke [4]. The idea was to have a kernel module that could catch and report hardware-related errors within the system. This goes beyond just memory errors and includes hardware errors in the cache, DMA, fabric switching, thermal throttling, hypertransport bus, and so on. The formal name of the project was EDAC [5], Error Detection and Correction.

For many years, people wrote EDAC kernel modules for various chipsets so they could capture hardware-related error information and report it. This was initially done outside of the kernel at the beginning of the project, but, starting with kernel 2.6.16 (released March 20, 2006), edac was included with the kernel. Starting with kernel 2.6.18, EDAC showed up in the /sys filesystem, typically in /sys/devices/system/edac.

One of the best sources of information about EDAC can be found at the EDAC wiki [6]. The page discusses how to get started and is also a good location for EDAC resources (bugs, FAQs, mailing list, etc.).

Rather than focusing on getting EDAC working, however, I want to focus on what information it can provide and why it is important. I'll be using a Dell PowerEdge R720 as an example system. It has two processors (Intel E5-2600 series) and 128GB of ECC memory. It was running CentOS 6.2 during the tests.

For the test system, I checked with lsmod to see whether any EDAC modules were loaded:

login2$ /sbin/lsmod
...
sb_edac                12898  0
edac_core              46773  3 sb_edac
...

EDAC was loaded as a module, so I examined the directory /sys/devices/system/edac:

login2$ ls -s /sys/devices/ system/edac/
total 0
0 mc

Because I can only see the mc devices, EDAC is only monitoring the memory controller(s). If I probe a little further,

login2$ ls -s /sys/devices/ system/edac/mc
total 0
0 mc0  0 mc1

I find two EDAC components for this system, mc0 and mc1 . Listing 1 shows what I see when I peer into mc0 . The two memory controllers for this system control a number of DIMMs, which are laid out in a "chip-select" row (csrow ) and a channel table (chx ) (see the EDAC documentation for more details [7]). You can have multiple csrow values and multiple channels. For example, Listing 2 shows a simple ASCII sketch of two csrows and two channels.

Listing 1

Content of a Memory Controller

login2$ ls -s /sys/devices/system/edac/mc/mc0
total 0
0 ce_count         0 csrow1  0 csrow4  0 csrow7   0 reset_counters       0 size_mb
0 ce_noinfo_count  0 csrow2  0 csrow5  0 device   0 sdram_scrub_rate     0 ue_count
0 csrow0           0 csrow3  0 csrow6  0 mc_name  0 seconds_since_reset  0 ue_noinfo_count

Listing 2

csrows and Channels

         Channel 0  Channel 1
==============================
csrow0  | DIMM_A0  | DIMM_B0 |
csrow1  | DIMM_A0  | DIMM_B0 |
==============================
==============================
csrow2  | DIMM_A1  | DIMM_B1 |
csrow3  | DIMM_A1  | DIMM_B1 |
==============================

The number of csrows depends on the electrical loading of a given motherboard and the memory controller and DIMM characteristics.

An Example

For this example node, each memory controller has eight csrows and one channel table. You can get an idea of the layout by looking at the entries for csrowX (X = 0 to 7) in Listing 3.

Listing 3

Memory Controller Layout

login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label
CPU_SrcID#0_Channel#0_DIMM#0
login2$ more /sys/devices/system/edac/mc/mc0/csrow1/ch0_dimm_label
CPU_SrcID#0_Channel#0_DIMM#1
login2$ more /sys/devices/system/edac/mc/mc0/csrow2/ch0_dimm_label
CPU_SrcID#0_Channel#1_DIMM#0
login2$ more /sys/devices/system/edac/mc/mc0/csrow3/ch0_dimm_label
CPU_SrcID#0_Channel#1_DIMM#1
login2$ more /sys/devices/system/edac/mc/mc0/csrow4/ch0_dimm_label
CPU_SrcID#0_Channel#2_DIMM#0
login2$ more /sys/devices/system/edac/mc/mc0/csrow5/ch0_dimm_label
CPU_SrcID#0_Channel#2_DIMM#1
login2$ more /sys/devices/system/edac/mc/mc0/csrow6/ch0_dimm_label
CPU_SrcID#0_Channel#3_DIMM#0
login2$ more /sys/devices/system/edac/mc/mc0/csrow7/ch0_dimm_label
CPU_SrcID#0_Channel#3_DIMM#1

This information shows four memory channels (0-3) and two DIMMs for each channel (0 and 1). Note that each csrow subdirectory for each memory controller has several EDAC control and attribute files for that csrow.

For example, the output for mc0/csrow0 (Listing 4) shows that all are files (no further subdirectories). The definition of each file is:

Listing 4

mc0/csrow0

login2$ ls -s /sys/devices/system/edac/mc/mc0/csrow0
total 0
0 ce_count      0 ch0_dimm_label  0 edac_mode  0 size_mb
0 ch0_ce_count  0 dev_type        0 mem_type   0 ue_count
  • ce_count: The total count of correctable errors that have occurred on this csrow (attribute file).
  • ch0_ce_count: The total count of correctable errors on this DIMM in channel 0 (attribute file).
  • ch0_dimm_label: The control file that labels this DIMM. This can be very useful for panic events to isolate the cause of the uncorrectable error. Note that DIMM labels must be assigned after booting with information that correctly identifies the physical slot with the silk screen label on the board itself.
  • dev_type: An attribute file that will display the type of DRAM device being used on this DIMM. Typically this is x1, x2, x4, or x8.
  • edac_mode: An attribute file that displays the type of error detection and correction being utilized.
  • mem_type: An attribute file that displays the type of memory currently on a csrow.
  • size_mb: An attribute file that contains the size (MB) of memory a csrow contains.
  • ue_count: An attribute file that contains the total number of uncorrectable errors that have occurred on a csrow.

For the sample system, the values for the attribute and control files are shown in Listing 5. This particular csrow has an 8GB DDR3 unbuffered DIMM with no correctable or uncorrectable errors.

Listing 5

Sample System csrow0 Values

login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ce_count
0
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count
0
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label
CPU_SrcID#0_Channel#0_DIMM#0
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/dev_type
x8
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/edac_mode
S4ECD4ED
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/mem_type
Unbuffered-DDR3
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/size_mb
8192
login2$ more /sys/devices/system/edac/mc/mc0/csrow0/ue_count
0

Some attribute files in /sys/devices/system/edac/mc/mc0/ can be very useful (Listing 6). As with the csrow files, note the control and attribute files, which although similar to those for the csrows, are for the entire memory controller:

Listing 6

Attribute Files for mc0

login2$ ls -s /sys/devices/system/edac/mc/mc0
total 0
0 ce_count         0 csrow1  0 csrow4  0 csrow7   0 reset_counters       0 size_mb
0 ce_noinfo_count  0 csrow2  0 csrow5  0 device   0 sdram_scrub_rate     0 ue_count
0 csrow0           0 csrow3  0 csrow6  0 mc_name  0 seconds_since_reset  0 ue_noinfo_count
  • ce_count: The total count of correctable errors that have occurred on this memory controller (attribute file).
  • ce_noinfo_count: The total count of correctable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).
  • mc_name: The type of memory controller being utilized (attribute file).
  • reset_counters: A write-only control file that zeroes out all of the statistical counters for correctable and uncorrectable errors on this memory controller and resets the timer indicating how long it has been since the last reset (counter zero). The basic command is
echo <anything> /sys/devices/system/edac/mc/mc0/reset_counters

where <anything> is literally anything (just use a 0 to make things easy).

  • sdram_scrub_rate: An attribute file that controls memory scrubbing. The scrubbing rate is set by writing a minimum bandwidth in bytes per second to the attribute file. The rate will be translated to an internal value at the specified rate. If the configuration fails or memory scrubbing is not implemented, the value of the attribute file will be -1.
  • seconds_since_reset: An attribute file that displays how many seconds have elapsed since the last counter reset. This can be used with the error counters to measure error rates.
  • size_mb: An attribute file that contains the size (MB) of memory that this memory controller manages.
  • ue_count: An attribute file that contains the total number of uncorrectable errors that have occurred on this memory controller.
  • ue_noinfo_count: The total count of uncorrectable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).

For the sample system, the values for the attribute and control files are shown in Listing 7. Notice that I can't read the reset_counters file because it is just a control file for resetting the memory error counters. However, it has been 27,759,752 seconds (7,711 hours or 321 days) since the counters were reset (basically, since the system was booted), and the memory controller is managing about 64GB of memory, with no correctable errors (CEs) or uncorrectable errors (UEs) on the system.

Listing 7

Sample System mc0 Values

login2$ more /sys/devices/system/edac/mc/mc0/ce_count
0
login2$ more /sys/devices/system/edac/mc/mc0/ce_noinfo_count
0
login2$ more /sys/devices/system/edac/mc/mc0/mc_name
Sandy Bridge Socket#0
login2$ more /sys/devices/system/edac/mc/mc0/reset_counters
/sys/devices/system/edac/mc/mc0/reset_counters: Permission denied
login2$ more /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
login2$ more /sys/devices/system/edac/mc/mc0/seconds_since_reset
27759752
login2$ more /sys/devices/system/edac/mc/mc0/size_mb
65536
login2$ more /sys/devices/system/edac/mc/mc0/ue_count
0
login2$ more /sys/devices/system/edac/mc/mc0/ue_noinfo_count
0

Also notice that the system is using Sandy Bridge processors (mc_name). Recall that with newer processors, the memory controller is in the processor. Consequently, the memory controller (mc) will be listed as a processor.

System Administration Recommendations

The edac module in the sysfs filesystem (i.e., /sys/) has a huge amount of information about memory errors. Normally, you wouldn't expect memory errors, either correctable or uncorrectable, to occur very often. However, as a good administrator, you should periodically scan your systems for memory errors.

Writing a simple script to read the file attributes of the memory errors for a system's memory controllers is not difficult, and you can even store these in a simple database if you like.

I also found a Nagios plugin [8] that should allow you to check for memory errors, although I haven't tested it. This plugin can be run as a simple script and gives you a bit of information; however, I prefer a simple script that tells me whether I have any problems and where they are, so I modified the original script, which is really close to the original (Listing 8).

Listing 8

Modified Nagios Plugin

01 #!/bin/bash
02 #
03 # Original script: https://bitbucket.org/darkfader/nagios/src/c9dbc15609d0/check_mk/edac/plugins/edac?at=default
04
05 # The best stop for all things EDAC is http://buttersideup.com/edacwiki/ and the edac.txt in the kernel doc.
06
07 # EDAC memory reporting
08 if [ -d /sys/devices/system/edac/mc ]; then
09   # Iterate all memory controllers
10   for mc in /sys/devices/system/edac/mc/* ; do
11     ue_total_count=0
12     ce_total_count=0
13     for csrow in $mc/csrow* ; do
14         ue_count=`more $csrow/ue_count`
15         ce_count=`more $csrow/ce_count`
16         if [ "$ue_count" -gt 1 ]; then
17             ue_total_count=ue_total_count+$ue_count;
18         fi
19         if [ "$ce_count" -gt 1 ]; then
20             ce_total_count=ce_total_count+$ce_count;
21         fi
22     done
23     echo "   Uncorrectable error count is $ue_total_count on memory controller $mc "
24     echo "     Correctable error count is $ce_total_count on memory controller $mc "
25   done
26 fi

I could have just queried the CE and UE memory error counts for the memory controller (mc), but I chose to search through the channel tables, so I could modify the script, just in case I want to point to the specific DIMM reporting more than zero errors.

You can modify this script to return the UE and CE count values. High-performance computing people can also put this script into something like Ganglia [9] to track memory error counts. A simple cron job could run this script, although I don't think you would want to run it every minute. Running it once an hour at most or maybe once a day is more reasonable. If you start to notice the correctable error count climb slowly, you might want to run the script more often.

Notice that I didn't compute "error rates." Some vendors want to know this number, possibly to judge whether the memory is bad, but I will leave this exercise up to you.

Finally, if you see a correctable error, it does not mean the memory DIMM is bad. However, if you see one, keep checking that DIMM, just in case. If the error count keeps rising, you might want to contact your system vendor. Vendors typically do not publish correctable or uncorrectable error rates, but you can call them and discuss what you are seeing on your system, because there might be a threshold at which they will replace the DIMM (they will usually discuss this with their memory vendor).

Infos

  1. ECC Memory: http://en.wikipedia.org/wiki/ECC_memory
  2. Normand, E. Single event upset at ground level IEEE Transactions on Nuclear Science , 1996; 43(6):2742-2750: http://pdf.yuri.se/files/art/2.pdf
  3. Schroeder, B., E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In: Douceur, J.R., Greenberg, A.G., Bonald, T., and Nieh, J. (eds.), Proceedings of the 11th International Joint Conference on Modeling of Computer Systems, SIGMETRICS/Performance 2009 (Seattle, Washington, ACM, 2009), pp. 193-204: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
  4. Bluesmoke: http://bluesmoke.sourceforge.net/
  5. EDAC: http://en.wikipedia.org/wiki/Error_detection_and_correction
  6. EDAC wiki: http://buttersideup.com/edacwiki/Main_Page
  7. EDAC documentation: https://www.kernel.org/doc/Documentation/edac.txt
  8. EDAC Nagios plugin: https://bitbucket.org/darkfader/nagios/src/c9dbc15609d0/check_mk/edac/plugins/edac?at=default
  9. Ganglia: http://ganglia.sourceforge.net/

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Monitoring Memory Errors

    One resource extremely important to your applications is system memory, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

  • Memory Errors

    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.

  • Finding and recording memory errors
    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.
comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=