Lead Image © Gerard Boissinot, Fotolia.com

Lead Image © Gerard Boissinot, Fotolia.com

System logging for data-based answers

Log Everything

Article from ADMIN 43/2018
By
To be a good HPC system administrator for today's environment, you need to be a lumberjack.

Oh, I'm a lumberjack, and I'm okay,

I sleep all night and I work all day.

– from "Lumberjack" byMonty Python

Can't you just imagine yourself in the wilds of British Columbia swinging your ax, breathing fresh air, sleeping under the stars?!!! I can't either, but Monty Python's "Lumberjack" song has a strong message for admins, particularly HPC admins – Log Everything.

Why log everything? Doesn't that require a great deal of work and storage? The simple answer to these questions is yes. In fact, you might need to start thinking about a small logging cluster in conjunction with the HPC computational cluster. Such a setup will give you answers to questions.

Answering questions is the cornerstone of running HPC systems. These questions include those from users such as, "Why is my application not running?" or "Why is my application running slow?" or "Why did I run out of space?" It also answers system administrator questions such as, "What commands did the user run?" or "What nodes was the user allocated during their run?" or "Is the user storing a bunch of Taylor Swift videos?"

If you haven't read about the principle of Managing Up [1], you should. One of the keys of this dynamic is anticipating questions your manager might ask, such as something seemingly as simple as "How's the cluster running?" or something with a little more meat to it such as "Why isn't Ms. Johnson's application running?" or perhaps the targeted question, "How could you screw up so badly?" Implicit in these questions are questions from your manager's manager, and on up the chain. Managing up means anticipating these questions or situations that might be encountered up the management chain (answering the "Bob's" question about what you actually do). More than likely, management is not being abusive, but several people have taken responsibility for spending a great deal of money on your fancy cluster, and they want to know how it's being utilized and if it's worth the investment.

The way to answer these questions is to have data. Data-based answers are always better than guesses or suppositions. What's the best way to have data? Be a lumberjack and log everything.

Logging

Regardless of what you monitor, you need to be a lumberjack and log it. HPC systems can be running a few nodes or tens of thousands of nodes. The metrics for each node need to be monitored and logged.

The first step in logging is deciding how to write the logs. For example, you could write the logs as a non-root user to a file located somewhere on a common cluster filesystem. A simple way to accomplish this is to create a special user, perhaps lumberjack, and have this user write logs to their /home directory that is mounted across the cluster.

The logs written by this user should have file names specific to each node for each entry, which allows you to determine the source of the messages. You should also put a time stamp with each log entry so that you can get a time history of events.

Another good option relative to writing logs to a user directory is to use the really cool Linux tool logger [2], which allows a user to write a message to the system logs. For example, you could easily run the command

$ logger "Just a notification"

to write a message to the standard system log /var/log/syslog located on each node. By default, it also writes the time stamp with the log entry. You can specify the log as well, in case you don't want to write to /var/log/syslog. Just use the -f <file> option, where <file> is the fully qualified path to the log file (just to make sure).

If you haven't noticed yet, logger writes the messages to the local logs, so each node has its own log. However, you really want to collect the logs in a single location to parse them together; therefore, you need a way to gather the logs from all of the nodes to a central location.

A key to good logging habits is to copy or write logs from remote servers (compute nodes) to a central logging server, so you have everything in one place, making it easier to make sense of what is happening on the server(s). You have several ways to accomplish this, ranging from easy to a bit more difficult.

Remote Logging the Easy Way

The simple way to perform remote logging comes with some risk: Configure a cron job on every node that periodically copies the node system logs to the centralized log server. The risk is that logs are copied only in the time period specified in the cron job, so if something happens on the node during that time, you won't have any system logs for that node on the log server.

A simple script for copying the logs would likely use scp to copy the logs securely from the local node to the log server. You can copy whatever logs or files you like. A key consideration is what you name the files on the log server. Most likely, you will want to put the node name in the name of the logfiles. For example, the name might be node001-syslog, which allows you to store the logs without worrying about overwriting files from other nodes.

Another key consideration is to include the time stamp when the log copy occurs, which, again, lets you keep files separate without fear of overwriting and makes the creation of a time history much easier.

Remote Logging with rsyslog

Another popular option is rsyslog (remote syslog) [3], an open source tool for forwarding log messages to a central server using an IP network. It is very configurable using the /etc/rsyslog.conf file and the files in the /etc/rsyslog.d/ directory to define the various configuration options. Because the tool is so configurable and flexible, be sure to read the man pages very carefully.

You can get started fairly easily with rsyslog by using the defaults. On the remote host that collects the logs, you begin by editing the /etc/rsyslog.conf file, uncommenting the following lines:

$ModLoad imtcp
$InputTCPServerRun 514

These lines tell rsyslog to use TCP, which is port 514 by default. After the change, you should restart the rsyslog server.

On every node that is to send its logs to the logging node, you need to make some changes. First, in the file /etc/rsyslog.d/loghost.conf, make sure you have a line such as

*.* @@<loghost>:514

where <loghost> is the name of the logging host (use either the IP address or resolvable hostname), the *.* refers to all logging facilities and priorities, the @@ portion tells rsyslog to use TCP for log transfer (an @ alone would tell it to use UDP), and 514 is the TCP port. After this change is made, restart the service on the node and every node that is to send the logs to the logging server.

In the logfiles on the logging server, the hostname of the node will appear, so you can differentiate logs on the basis of hostnames.

You can use either of these approaches, or one that you create, to store all of the system logs in a central location (a logging server). Linux comes with standard logs that can be very useful [4]; alternatively, you might want to think about creating your own logs. In either case, you can log whatever information you feel is needed. The next few sections present some options you might want to consider.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • System Logging for Data-Based Answers

    To be a good HPC system administrator for today’s environment, you need to be a lumberjack.

     

  • What to Do with System Data: Think Like a Vegan

    What do you do with all of the HPC data you harvested as a lumberjack? You think like a Vegan.

  • Gathering Data on Environment Modules

    Gathering data on various aspects of your HPC system is a key step toward developing information about the system and one of the first steps toward tuning your system for performance and reporting on system use. It can tell how users are using the system and, at a high level, what they are doing. In this article, I present a method for gathering data on how users are using Environment Modules, such as which modules are being used, how often, and so on.

  • Log Management

    One of the more mundane, perhaps boring, but necessary administration tasks is checking system logs – the source of knowledge or intelligence of what is happening in the cluster.

  • Nmon: All-Purpose Admin Tool

    HPC administrators sometimes assume that if all nodes are functioning, the system is fine. However, the most common issue users have is poor or unexpected application performance. In this case, you need a simple tool to help you understand what’s happening on the nodes.

comments powered by Disqus