Finding and recording memory errors
Amnesia
A recent article in IEEE Spectrum [1] by Al Geist, titled "How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder," reviewed some of the major ways a supercomputer can be killed. The first subject the author discussed was how cosmic rays can cause memory errors, both correctable and uncorrectable. To protect against some of these errors, ECC (error-correcting code) memory [2] can be used.
The general ECC memory used in systems today can detect and correct single-bit errors (changes to a single bit). For example, assume a byte with a value of 156 (10011100) is read from a file on disk; if the second bit from the left is flipped from a 0 to a 1 (11011100), the number becomes 220. A simple flip of one bit in a byte can make a drastic difference in its value. Fortunately, ECC memory can detect and correct the bit flip, so the user does not notice.
The current ECC memory also can detect a double bit flip, but it cannot correct that. When a double bit error happens, the memory should cause a machine check exception (MCE) [3], which should crash the system. The bad data in memory could be related to an application or to instructions in an application or the operating system. Rather than risk any of these scenarios, the system rightly crashes, indicating the error as best it can.
The Wikipedia article on ECC states that most of the single-bit flips are due to background radiation, primarily neutrons from cosmic rays. The article reports that error rates from 2007 to 2009 varied quite a bit, ranging from 10-10 to 10-17 errors/bit-hour, which is seven orders of magnitude difference. The upper number is just about one error per gigabit of memory per hour. The lower number indicates roughly one error every 1,000 years per gigabit of memory.
A Linux kernel
...Buy this article as PDF
(incl. VAT)