Log analysis in high-performance computing
State of the Cluster
Gathering logs from distributed systems for manual searching is a typical task performed in high-performance computing (HPC) [1]. Log analysis is important for cybersecurity, understanding HPC cluster behavior, and event and trend analysis. In this article, I address the state of the art in log analysis and how it can be applied to HPC.
Origins
Log analysis can produce information through a variety of functions and technologies, including:
- ingestion
- centralization
- normalization
- classification and logging
- pattern recognition
- correlation analysis
- monitoring and alerts
- artificial ignorance
- reporting
Logs are great for checking the health of a set of systems and can be used to locate obvious problems, such as kernel modules not loading. They can also be used to find attempts to break into systems through various means, including shared credentials. However, these examples do not really take advantage of all the information contained in logs: Log analysis can be used to improve system administration skills.
When analyzing or just watching logs over a period of time, you can get a feel for the rhythm of your systems; for example: When do people log in and out of the system? What kernel modules are loaded? What, if any, errors occur (and when)? The answers to these questions allow you to recognize when things don't seem quite right with the systems (events) that "normal" log analysis might miss. A great question is: Why does user X have a new version of an application? Normal log analysis would not care about this query, but perhaps the user needed a new version and could indicate that others might also need the newer version, prompting you to build and make it available to all.
Developing an intuition of how a system or, in HPC, systems behave can take a long time and might be impossible to achieve, but it can also be accomplished by watching logs. If you happen to leave or change jobs, a new admin would have to start from scratch to develop this systems intuition. Perhaps you have a better way with the use of log analysis on your HPC systems. Before going there, I'll look at the list of technologies presented at the beginning of this article.
Ingestion and Centralization
The ingestion and centralization step is important for HPC systems because of their distributed nature. Larger systems would use methods that ingest the logs to a dedicated centralized server, whereas smaller systems or smaller logs could use a virtual machine (VM). Of course, these logs need to be tagged with the originating system so they can be differentiated.
When you get to the point of log analysis, you really aren't just talking about system logs. Log ingestion also means logs from network devices, storage systems, and even applications that don't always write to /var/log
.
A key factor that can be easily neglected in log collection from disparate systems and devices is the time correlation of these logs. If something happens on a system at a particular time, other systems might have been affected, as well. The exact time of the event is needed to search across logs from other systems and devices to correlate all logs. Therefore, enabling the Network Time Protocol (NTP) [2] on all systems is critical.
Normalization
Normalization is just what it sounds like: the process of converting all or parts of the log content to the same format – most importantly for items such as the time stamp of the log entry and the IP address.
The tricky bit of normalization is automation. Some logging tools understand formats from various sources, but not all of them. You might have to write a bit of code to convert the log entries into a format suitable for the analysis. The good news is that you only have to do this once for each source of log entries, of which there aren't many.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.