Finding and Recording Memory Errors

Analysis Tool

The second tool can be as simple or as complex as desired. A basic function would plot the error rate values for each DIMM for each host and memory controller as a function of time. Additionally, the memory errors can be summed for each host and the memory error rate plotted versus time for each host.

Another useful function of the analysis tool would be to conduct a statistical analysis of the error rates to uncover trends in the historical data. It could be as simple as computing the average and standard deviation of the error rate over time (looking to see if the error rates are increasing or decreasing) or as complex as examining the error rates as functions of time or location in the data center.

The code in Listing 2 is a very simple Python script that reads the CSV file and creates a list of lists (like a 2D array).

Listing 2: Reading the Scanned Data

#!/usr/bin/python
 
import csv;
 
# ===================
# Main Python section
# ===================
#
if __name__ == '__main__':
 
    with open('file.txt', 'rb') as f:
        reader = csv.reader(f);
        data_list = list(reader);
    # end with
 
    print data_list;
 
# end if

Although the code is short, it illustrates how easy it is to read the CSV data. From this point, error rates can be computed along with all sorts of statistical analyses and graphing.

Parting Words

As mentioned in the article about how to kill a supercomputer, memory errors, either correctable or uncorrectable, can lead to problems. Keeping track of error rates over time is an important system aspect to be monitored.

A huge “thank you” is owed to Dr. Tommy Minyard at the University of Texas Advanced Computing Center (TACC) and to Dr. James Cuff and Dr. Scott Yockel at Harvard University, Faculty of Arts and Sciences Research Computing (FAS RC), for their help with access to systems used for testing.

Related content

  • Error-correcting code memory keeps single-bit errors at bay
    System memory is extremely important to your applications, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.
  • Monitoring Memory Errors

    One resource extremely important to your applications is system memory, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

  • Finding and recording memory errors
    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.
  • RAM Revealed

    Virtualized systems are inflationary when it comes to RAM requirements. Storage access is faster when excess RAM is used as a page cache, and having enough RAM helps avoid the dreaded performance killer, swapping. We take a look at the current crop of RAM.

comments powered by Disqus