Lead Image © lightwise, 123RF.com

Lead Image © lightwise, 123RF.com

Finding and recording memory errors

Amnesia

Article from ADMIN 32/2016
By
Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.

A recent article in IEEE Spectrum [1] by Al Geist, titled "How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder," reviewed some of the major ways a supercomputer can be killed. The first subject the author discussed was how cosmic rays can cause memory errors, both correctable and uncorrectable. To protect against some of these errors, ECC (error-correcting code) memory [2] can be used.

The general ECC memory used in systems today can detect and correct single-bit errors (changes to a single bit). For example, assume a byte with a value of 156 (10011100) is read from a file on disk; if the second bit from the left is flipped from a 0 to a 1 (11011100), the number becomes 220. A simple flip of one bit in a byte can make a drastic difference in its value. Fortunately, ECC memory can detect and correct the bit flip, so the user does not notice.

The current ECC memory also can detect a double bit flip, but it cannot correct that. When a double bit error happens, the memory should cause a machine check exception (MCE) [3], which should crash the system. The bad data in memory could be related to an application or to instructions in an application or the operating system. Rather than risk any of these scenarios, the system rightly crashes, indicating the error as best it can.

The Wikipedia article on ECC states that most of the single-bit flips are due to background radiation, primarily neutrons from cosmic rays. The article reports that error rates from 2007 to 2009 varied quite a bit, ranging from 10-10 to 10-17 errors/bit-hour, which is seven orders of magnitude difference. The upper number is just about one error per gigabit of memory per hour. The lower number indicates roughly one error every 1,000 years per gigabit of memory.

A Linux kernel

...
Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Error-correcting code memory keeps single-bit errors at bay
    System memory is extremely important to your applications, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.
  • Memory Errors

    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.

  • Monitoring Memory Errors

    One resource extremely important to your applications is system memory, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

comments powered by Disqus