Log analysis in high-performance computing
State of the Cluster
NVIDIA Morpheus
I want to mention a new tool from NVIDIA designed for cybersecurity. In full disclosure, I work for NVIDIA as my day job, but I'm not endorsing the product, nor do I have any inside knowledge of the product, so I will use publicly available links. That said, NVIDIA has a new software development kit (SDK) that addresses cybersecurity with neural network models. The SDK, called Morpheus [12], is "… an open application framework that enables cybersecurity developers to create optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data." It provides real-time inferencing (not training) on cybersecurity data.
Log analysis looks for cybersecurity-like events and includes actual cybersecurity events. The product web page lists some possible use cases:
- digital fingerprinting
- sensitive information detection
- crypto mining malware detection
- phishing detection
- fraudulent transaction and identity detection
- ransomware
The digital fingerprinting use case "Uniquely fingerprint(s) every user, service, account, and machine across the enterprise data center – employing unsupervised learning to flag when user and machine activity patterns shift," which is great to watch for break-ins on HPC systems but also to watch for shifting patterns. Instead of developing a fingerprint of user behavior, it could be used to create a fingerprint of application versions.
For example, a user might start out using a specific version of an application, but as time goes on, perhaps they use a newer version. This event signals the administrator to install and support the new version and consider deprecating any old versions. This scenario is especially likely in deep learning applications because frameworks develop quickly, new frameworks are introduced, and other frameworks stop being developed.
Another example might be job queues that are becoming longer than usual, perhaps indicating that more resources are needed. A model that detects this event and creates information as to why would be extremely useful.
The framework could also be used to watch data storage trends. Certainly storage space will increase, but which users are consuming the most space or have the most files can be identified and watched in case it is something unusual (like downloading too many KC and The Sunshine Band mp4s to the HPC). There really are no limits to how Morpheus could be used in HPC, especially for log analysis in general.
Summary
Log analysis is a very useful tool for many administration tasks. You can use it for cybersecurity, understanding how an HPC cluster normally behaves, identifying events and trends within the cluster, the need for more resources, or anything you want to learn about the cluster. The use of log analysis in your HPC systems is up to you because it can mean adding several servers and a fair amount of storage to your system. However, don't think of log analysis as just a cybersecurity tool. You can use it for many HPC-specific tasks, greatly adding to the administration of the system. Plus it can make pretty reports, which management always loves.
The future of log analysis will probably morph into an AI-based tool that takes the place of several of the technologies in current log analysis. Instead of a single trained AI, a federated set of trained networks will probably be used. Other models will likely go back and review past logs, either for training or to create a behavior description of how the cluster operates. This area of HPC technology has lots of opportunities.
Infos
- "Log Management" by Jeff Layton: https://www.admin-magazine.com/HPC/Articles/Log-Management
- NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol
- Splunk: https://www.splunk.com/en_us/blog/learn/log-management.html
- Splunk alternatives: https://www.mezmo.com/blog/5-splunk-alternatives-for-logging-their-benefits-shortcomings-and-which-one-to-choose
- Elastic: https://www.elastic.co/
- Logstash: https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html
- Elasticsearch: https://en.wikipedia.org/wiki/Elasticsearch
- Lucene: https://lucene.apache.org/
- Elasticsearch defaults: https://www.elastic.co/guide/en/elasticsearch/reference/current/settings.html
- Kibana: https://en.wikipedia.org/wiki/Kibana
- Kibana defaults: https://www.elastic.co/guide/en/kibana/current/settings.html
- Morpheus: https://developer.NVIDIA.com/morpheus-cybersecurity
Buy this article as PDF
(incl. VAT)