![Photo by Kier… in Sight on Unsplash Photo by Kier… in Sight on Unsplash](/var/ezflow_site/storage/images/archive/2022/72/log-analysis-in-high-performance-computing/photobykier-insightonunsplash_cluster.png/200189-1-eng-US/PhotobyKier-inSightonUnsplash_cluster.png_medium.png)
Photo by Kier… in Sight on Unsplash
Log analysis in high-performance computing
State of the Cluster
Gathering logs from distributed systems for manual searching is a typical task performed in high-performance computing (HPC) [1]. Log analysis is important for cybersecurity, understanding HPC cluster behavior, and event and trend analysis. In this article, I address the state of the art in log analysis and how it can be applied to HPC.
Origins
Log analysis can produce information through a variety of functions and technologies, including:
- ingestion
- centralization
- normalization
- classification and logging
- pattern recognition
- correlation analysis
- monitoring and alerts
- artificial ignorance
- reporting
Logs are great for checking the health of a set of systems and can be used to locate obvious problems, such as kernel modules not loading. They can also be used to find attempts to break into systems through various means, including shared credentials. However, these examples do not really take advantage of all the information contained in logs: Log analysis can be used to improve system administration skills.
When analyzing or just watching logs over a period of time, you can get a feel for the rhythm of your systems; for example: When do people log in and out of the system? What kernel modules are loaded? What, if any, errors occur (and when)? The answers to these questions allow you to recognize when things don't seem quite right with the systems (events) that "normal" log analysis might miss. A great question is: Why does user X have a new version of an application? Normal log analysis would not care about this query, but perhaps the user needed a new version and could indicate that others might also need the newer version, prompting you to build and make it available to all.
Developing an intuition of how a system
...Buy this article as PDF
(incl. VAT)