Finding and recording memory errors
Amnesia
A recent article in IEEE Spectrum [1] by Al Geist, titled "How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder," reviewed some of the major ways a supercomputer can be killed. The first subject the author discussed was how cosmic rays can cause memory errors, both correctable and uncorrectable. To protect against some of these errors, ECC (error-correcting code) memory [2] can be used.
The general ECC memory used in systems today can detect and correct single-bit errors (changes to a single bit). For example, assume a byte with a value of 156 (10011100) is read from a file on disk; if the second bit from the left is flipped from a 0 to a 1 (11011100), the number becomes 220. A simple flip of one bit in a byte can make a drastic difference in its value. Fortunately, ECC memory can detect and correct the bit flip, so the user does not notice.
The current ECC memory also can detect a double bit flip, but it cannot correct that. When a double bit error happens, the memory should cause a machine check exception (MCE) [3], which should crash the system. The bad data in memory could be related to an application or to instructions in an application or the operating system. Rather than risk any of these scenarios, the system rightly crashes, indicating the error as best it can.
The Wikipedia article on ECC states that most of the single-bit flips are due to background radiation, primarily neutrons from cosmic rays. The article reports that error rates from 2007 to 2009 varied quite a bit, ranging from 10-10 to 10-17 errors/bit-hour, which is seven orders of magnitude difference. The upper number is just about one error per gigabit of memory per hour. The lower number indicates roughly one error every 1,000 years per gigabit of memory.
A Linux kernel module called EDAC [4], which stands for error detection and correction, can report ECC memory errors and corrections. EDAC can capture and report error information for hardware errors in the memory or cache, direct memory access (DMA), fabric switches, thermal throttling, HyperTransport bus, and others. One of the best sources of information about EDAC is the EDAC wiki [5].
Important Considerations
Monitoring ECC errors and corrections is an important task for system administrators of production systems. Rather than monitor, log, and maybe alarm based on the absolute number of ECC errors or corrections, the rate of change of errors and corrections should be monitored. The unit of measure to use is up to the system administrator, but the commonly used unit reported in the Wikipedia article is errors per gigabit of memory per hour.
In a previous article [6], I wrote a general introduction to ECC memory, specifically about Linux and memory errors, and how to collect correctable and uncorrectable errors. In typical systems, such as the one examined in the article, you can have more than one memory controller. The example was a two-socket system with two memory controllers, mc0
and mc1
. You can get this information with the command:
$ ls -s /sys/devices/system/edac/mc
The memory associated with each memory controller is organized in physical DIMMs, which are laid out in a "chip-select" row (csrow) and a channel table values.
According to the kernel documentation for EDAC [7], typical memory has eight csrows, but it really depends on the layout of the motherboard, the memory controller, and the DIMM characteristics. The number of csrows can be found by examining the /sys
entries for a memory controller. For example:
$ ls -s /sys/devices/system/edac/mc/mc0
The number of elements, labeled csrow<X>
(where <X>
is an integer) are counted to determine the number of csrows for the memory controller (Listing 1). In this case, I had two memory channels per controller and four DIMMs per channel, for a total of eight values for csrow (csrow0
to csrow7
).
Listing 1
Attribute Files for mc0
$ ls -s /sys/devices/system/edac/mc/mc0 total 0 0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb 0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count 0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
A number of entries in the /sys
filesystem entry for each csrow contains a lot of information about the specific DIMM. Listing 2 shows the csrow0
attributes. Each of these entries has information that can be used for monitoring or is a control file. Listing 3 shows a list of system values for some of the mc0
attributes from Listing 1 for the test system. Note that reset_counters
is a control file that, unsurprisingly, lets you reset the counters.
Listing 2
Content of csrow0
$ ls -s /sys/devices/system/edac/mc/mc0/csrow0 total 0 0 ce_count 0 ch0_dimm_label 0 edac_mode 0 size_mb 0 ch0_ce_count 0 dev_type 0 mem_type 0 ue_count
Listing 3
Attribute Values of mc0
$ more /sys/devices/system/edac/mc/mc0/ce_count 0 $ more /sys/devices/system/edac/mc/mc0/ce_noinfo_count 0 $ more /sys/devices/system/edac/mc/mc0/mc_name Sandy Bridge Socket#0 $ more /sys/devices/system/edac/mc/mc0/reset_counters /sys/devices/system/edac/mc/mc0/reset_counters: Permission denied $ more /sys/devices/system/edac/mc/mc0/sdram_scrub_rate $ more /sys/devices/system/edac/mc/mc0/seconds_since_reset 27759752 $ more /sys/devices/system/edac/mc/mc0/size_mb 65536 $ more /sys/devices/system/edac/mc/mc0/ue_count 0 $ more /sys/devices/system/edac/mc/mc0/ue_noinfo_count 0
Three of the entries from Listing 2 are key to a system administrator:
- size_mb: The amount of memory (MB) this memory controller manages (attribute file).
- ce_count: The total count of correctable errors that have occurred on this memory controller (attribute file).
- ue_count: The total number of uncorrectable errors that have occurred on this memory controller (attribute file).
From this information, you can compute both the correctable (ce_
) and the uncorrectable (ue_
) error rates.
Time to Code Some Tools!
Good system administrators will write simple code to compute error rates. Great system administrators will create a database of the values so a history of the error rates can be examined. With this in mind, simple code takes the data entries from the /sys
filesystem for the memory controllers and the DIMMs and writes these values to a text file. The text file serves as the "database" of values, from which historical error rates can be computed and examined.
The first of two tools scans the /sys
filesystem and writes the values and the time of the scan to a file. Time is written in "seconds since the epoch" [8], which can be converted to any time format desired. The second tool reads the values from the database and creates a list within a list, which prepares the data for analysis, such as plotting graphs and statistical analyses.
Database Creation Code
The values from the /sys
filesystem scan are stored in a text file as comma-separated values (CSV format) [9], so it can be read by a variety of tools and imported into spreadsheets. The code can be applied to any system (host), so data entry begins with the hostname. A shared filesystem is the preferred location for storing the text file. A simple pdsh
command can run the script on all nodes of the system and write to the same text file. Alternatively, a cron job created on each system can run the tool at specified times and write to the text file. With central cluster control, the cron job can be added to the compute node instance or pushed to the nodes.
Before writing any code, you should define the data format. For the sake of completeness, all of the values for each csrow entry (all DIMMs) are written to the text file to allow a deeper examination of the individual DIMMs and to allow the error values to be summed for either each memory controller or the entire host (host-level examination).
The final file format is pretty simple to describe. Each row or line in the file will correspond to one scan of the /sys
filesystem, including all of the memory controllers and all of the csrow values. Each row will have the following comma-separated values.
- Hostname
- Time in seconds since epoch (integer)
- "mc" + memory controller number 0 to N (e.g.,
mc0
) - For each DIMM (csrow) in each memory controller 0 to N
- "csrow" + csrow number (e.g.,
csrow4
) - Memory DIMM size in gigabytes (
size_mb
) - Correctable error number (
ce_count
) for the particular csrow - Uncorrectable error number (
ue_count
) for the particular csrow
A sample of an entry in the database might be:
login2,1456940649,mc0,csrow0,8192,0,0,csrow1,8192,0,0,...
Listing 4 is based on sample code from a previous article [6].
Listing 4
/sys Filesystem Scan
#!/bin/bash # # # Original script: # https://bitbucket.org/darkfader/nagios/src/ # c9dbc15609d0/check_mk/edac/plugins/edac?at=default # The best stop for all things EDAC is # http://buttersideup.com/edacwiki/ and # edac.txt in the kernel doc. # EDAC memory reporting if [ -d /sys/devices/system/edac/mc ]; then host_str=`hostname` output_str="$host_str,`date +%s`" # Iterate all memory controllers i=-1 for mc in /sys/devices/system/edac/mc/* ; do i=$((i+1)) output_str="$output_str,mc$i" ue_total_count=0 ce_total_count=0 # Iterate all csrow values j=-1 for csrow in $mc/csrow* ; do j=$((j+1)) output_str="$output_str,csrow$j" ue_count=`more $csrow/ue_count` ce_count=`more $csrow/ce_count` dimm_size=`more $csrow/size_mb` if [ "$ue_count" -gt 1 ]; then ue_total_count=ue_total_count+$ue_count; fi if [ "$ce_count" -gt 1 ]; then ce_total_count=ce_total_count+$ce_count; fi output_str="$output_str,$dimm_size,$ce_count,$ue_count" done #echo " UE count is $ue_total_count on memory controller $mc " #echo " CE count is $ce_total_count on memory controller $mc " done echo "$output_str" >> /tmp/file.txt fi
The data is output to a text file in /tmp
for testing purposes. This location can be changed; as mentioned earlier, a shared filesystem is recommended.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.