« Previous 1 2
Finding and recording memory errors
Amnesia
Analysis Tool
The second tool can be as simple or as complex as desired. A basic function would plot the error rate values for each DIMM for each host and memory controller as a function of time. Additionally, the memory errors could be summed for each host and the memory error rate plotted versus time for each host.
Another use of the tool would be to conduct a statistical analysis of the error rates to uncover trends in the historical data. It could be as simple as computing the average and standard deviation of the error rate over time (looking to see if the error rates are increasing or decreasing) or as complex as examining the error rates as functions of time or location in the data center.
The code in Listing 5 is a very simple Python script that reads the CSV file and creates a list of lists (like a 2D array). Although the code is short, it illustrates how easy it is to read the CSV data. From this point, error rates can be computed along with all sorts of statistical analyses and graphing.
Listing 5
Reading the Scanned Data
#!/usr/bin/python import csv; # =================== # Main Python section # =================== # if __name__ == '__main__': with open('file.txt', 'rb') as f: reader = csv.reader(f); data_list = list(reader); # end with print data_list; # end if
Parting Words
As mentioned in the article about how to kill a supercomputer, memory errors, either correctable or uncorrectable, can lead to problems. Keeping track of error rates over time is an important system aspect to be monitored.
A huge "thank you" is owed to Dr. Tommy Minyard at the University of Texas Advanced Computing Center (TACC) and to Dr. James Cuff and Dr. Scott Yockel at Harvard University, Faculty of Arts and Sciences Research Computing (FAS RC), for their help with access to systems used for testing.
Infos
- "How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder" by Al Geist: http://spectrum.ieee.org/computing/hardware/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
- ECC memory: http://en.wikipedia.org/wiki/ECC_memory
- MCE: https://en.wikipedia.org/wiki/Machine-check_exception
- EDAC: http://en.wikipedia.org/wiki/Error_detection_and_correction
- EDAC wiki: http://buttersideup.com/edacwiki/Main_Page
- "Monitoring Memory Errors" by Jeff Layton, ADMIN, issue 17, 2014: http://www.admin-magazine.com/Archive/2013/17/Error-correcting-code-memory-keeps-single-bit-errors-at-bay/%28language%29/eng-US
- EDAC from kernel documentation: https://www.kernel.org/doc/Documentation/edac.txt
- Linux/Unix time: http://www.cyberciti.biz/faq/convert-epoch-seconds-to-the-current-time-date/
- CSV format: https://en.wikipedia.org/wiki/Comma-separated_values
« Previous 1 2
Buy this article as PDF
(incl. VAT)