Log analysis in high-performance computing
State of the Cluster
Visualizing the Data
Humans are great pattern recognition engines. We don't do well with text zooming on the screen, but we can pick out patterns in visual data. To help, the tool Kibana [10] ties into Elastic Stack to provide visualization. At a high level, it can create various charts of pretty much whatever you want or need.
Installing Kibana (Figure 1) is easiest from your package manager. If you read through the configuration document, you will see lots of options. I recommend starting with the defaults [11] before changing any of them. The documentation starts with alert and action settings followed by lots of other configuration options.
ELK Stack Wrap Up
Other components plug into the ELK stack. Fortunately, they have been tested and developed somewhat together so they should "just work." Most importantly, however, they fulfill the components of a log analysis system mentioned earlier: log collection, log data conversion and formatter, log search and analysis, and visualization.
AI
In the previous discussion about the technologies used in log analysis, machine learning was discussed – in particular, artificial ignorance, which uses machine learning to ignore and possibly discard log entries before searching the data. Pattern recognition also uses machine learning. As I discussed, although I don't know the details of how the machine learning models make decisions, I am not a big fan of discarding data just because it looks normal. Such data might be very useful in training deep learning models.
A deep learning model uses neural networks to process input to create some sort of output. These networks are trained with sample data for the inputs and have the matching expected output(s). To train such a model adequately, you need a very large amount of data that spans a wide range of conditions.
The data should be balanced according to the desired outputs for non-language models. For example, if you want to identify an image and have defined 10 possible image classes, then the input data should be fairly evenly distributed across each of the 10 classes. You can see the effect of a bad distribution if you run test images through the model and it has difficulty classifying images into a specific class. It might identify a cat as a dog if you don't have enough data, enough data in each class, or a broad enough data set.
If you are interested in special images or events that happen only very rarely, developing an appropriate dataset is very difficult and, as a result, makes it difficult to create an adequately trained model. An example of this is fraud detection. The model is supposed to identify when a fraudulent transaction happens. However, these are very rare events despite what certain news agencies say.
If you take data from transaction processing, you will have virtually the entire dataset filled with non-fraudulent data. Perhaps only a single-digit number of fraudulent transactions are in the dataset, so you have millions of non-fraudulent transactions and maybe three or four fraudulent transactions – most definitely a very unbalanced dataset.
For these types of situations, you invert the problem by creating a model to find the non-fraudulent transactions. Now you have a data set that is useful in creating the model. You are throwing millions of transactions at the model for it to learn with basically one output: Is the transaction fraudulent? A fraudulent transaction will then be relatively easy for the model to detect. Of course, other approaches will work, but this is a somewhat common approach of training models to detect rare events. Therefore, you shouldn't discard the non-interesting data, because it can be used to train a neural network model that is very useful.
In the case of HPC systems, the trained deep learning model can be used to augment your observations as to the "normal" behavior of a system. If something doesn't seem typical, then the model can quickly notify you. In HPC systems, non-typical events can be infrequent. For example, a user starts logging into the system at a later time than usual or they start running different applications. Therefore, the data set for HPC systems could have a fair number of events, and you don't have to "invert" the model to look for non-events.
Buy this article as PDF
(incl. VAT)