![Lead Image © Igor Stevanovic, 123RF.com Lead Image © Igor Stevanovic, 123RF.com](/var/ezflow_site/storage/images/archive/2013/17/error-correcting-code-memory-keeps-single-bit-errors-at-bay/123rf_16383358_umbrellas_igorstevanovic_resized.png/98557-1-eng-US/123RF_16383358_umbrellas_IgorStevanovic_resized.png_medium.png)
Lead Image © Igor Stevanovic, 123RF.com
Error-correcting code memory keeps single-bit errors at bay
Errant Bits
Data protection and checking occurs in various places throughout a system. Some of it happens in hardware and some of it happens in software. The goal is to ensure that data is not corrupted (changed), either coming from or going to the hardware or in the software stack. One key technology is ECC memory (error-correcting code memory) [1].
The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit errors, it cannot correct them. A simple flip of one bit in a byte can make a drastic difference in the value of the byte. For example, a byte (8 bits) with a value of 156 (10011100) that is read from a file on disk suddenly acquires a value of 220 if the second bit from the left is flipped from a 0 to a 1 (11011100) for some reason.
ECC memory can detect the problem and correct it, while the user is unaware. Note, however, that only one bit in the byte has been changed and then corrected. If two bites change – perhaps both the second and seventh from the left – the byte is now 11011110 (i.e., 222). Typical ECC memory can detect that the "double-bit" error occurred, but it cannot correct it. In fact, when a double-bit error happens, memory should cause what is called a "machine check exception" (mce), which should cause the system to crash.
After all, you are using ECC memory, so ensuring that the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop. The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system.
This interference can cause a bit to flip at seemingly random times, depending on the circumstances. According to a Wikipedia article [1] and a paper on single-event upsets in RAM
...