Log analysis in high-performance computing
State of the Cluster
Monitoring and Alerts
Log analysis tools usually include the ability to notify you about events that might require human intervention. These events can be tied to alerts in various forms, such as email or dashboards, so you are promptly notified. A good simple example is the loss of a stream of events to a system log. This usually requires a human to find out why the stream stopped.
A second example is if environmental properties in a data center go beyond their normal levels. Early in my career a small group of engineers wanted to have a meeting in a data center, so they turned off all of the air conditioning units because they were too loud. As the ambient temperature went above a critical temperature, I got several email messages and a beeper page. (That shows you how long ago it was.)
Artificial Ignorance
Because hardware problems are fairly rare overall, log analysis tools have implemented what they call "artificial ignorance," which uses a machine learning (ML) approach to discard log entries that have been defined as "uninteresting." Commonly, these entries are typical of a system that is operating normally and that don't provide any useful information (however "useful" is defined). The idea is to save log storage by ignoring and even deleting this uninteresting information.
My opinion, with which you can agree or disagree, is that artificial ignorance is not something I would enable for a long time. The uninteresting logs can provide information about how the system typically runs. For example, when do people log in to the system? When do users typically run jobs? A lot of day-to-day activities are important to know but are lost when artificial ignorance is used.
Although keeping or watching everyday activities might seem like Groundhog Day , I feel it is important for understanding the system. Dumping this data before you have a chance to develop this understanding is premature, in my opinion.
Ignoring the uninteresting data could also hinder understanding an event. In the previous section on Correlation Analysis, when an event occurs, you want to gather as much information around that event as possible. Artificial ignorance might ignore log entries that are related to the event but appear to be uninteresting. They could even be deleted, handicapping your understanding of the event. Lastly, this data can be important for other tasks or techniques in log analysis, as a later section will illustrate.
Reporting
Log analysis tools can create notifications and alerts, but many (most) of the tools can create a report of their analysis, as well. The report is customized to your system and probably your management because you will want to see all system problems – or at the very least, a summary view of the events.
Reports can also be your answer to compliance requests, with all the information needed to prove compliance (or show non-compliance). This feature is critical in Europe, where the General Data Protection Regulation (GDPR) topics are very important. If a request to remove specific data has been made, you must be able to prove that it was done. A log analysis tool should be able to confirm this in a report.
Buy this article as PDF
(incl. VAT)