Lead Image © Igor Stevanovic, 123RF.com

Lead Image © Igor Stevanovic, 123RF.com

Error-correcting code memory keeps single-bit errors at bay

Errant Bits

Article from ADMIN 17/2013
By
System memory is extremely important to your applications, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

Data protection and checking occurs in various places throughout a system. Some of it happens in hardware and some of it happens in software. The goal is to ensure that data is not corrupted (changed), either coming from or going to the hardware or in the software stack. One key technology is ECC memory (error-correcting code memory) [1].

The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit errors, it cannot correct them. A simple flip of one bit in a byte can make a drastic difference in the value of the byte. For example, a byte (8 bits) with a value of 156 (10011100) that is read from a file on disk suddenly acquires a value of 220 if the second bit from the left is flipped from a 0 to a 1 (11011100) for some reason.

ECC memory can detect the problem and correct it, while the user is unaware. Note, however, that only one bit in the byte has been changed and then corrected. If two bites change – perhaps both the second and seventh from the left – the byte is now 11011110 (i.e., 222). Typical ECC memory can detect that the "double-bit" error occurred, but it cannot correct it. In fact, when a double-bit error happens, memory should cause what is called a "machine check exception" (mce), which should cause the system to crash.

After all, you are using ECC memory, so ensuring that the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop. The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system.

This interference can cause a bit to flip at seemingly random times, depending on the circumstances. According to a Wikipedia article [1] and a paper on single-event upsets in RAM

...
Use Express-Checkout link below to read the full article (PDF).

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Monitoring Memory Errors

    One resource extremely important to your applications is system memory, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

  • Memory Errors

    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.

  • Finding and recording memory errors
    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.
comments powered by Disqus