Monitoring HPC Systems: What Should You Monitor?
I have to admit that monitoring is one of my favorite HPC Admin topics. I started out in HPC a long time ago and very quickly moved into (Beowulf) clusters. I became a cluster administrator around 1998, with some serious administration beginning around 2000, at a large aerospace company. The company had replaced some large SGI Origin systems with clusters, and the pressure was on to make them work well, because they were part of a production-level HPC installation. The clusters were about a 40x price/performance improvement over the Origins, so everyone wanted to see just how well these systems could perform. Performance and reliability were two big metrics being watched by management. As a consequence, I spent a great deal of time monitoring the system for performance and to see whether any of the nodes were down and for how long (why they went down was another story). Therefore, I became quite the “nut” about cluster monitoring, and I still am to this day.
If you put a number of cluster admins in a room together (e.g., the BeoBash), and you ask, “What is the best way to monitor a cluster?” you will have to duck and cover pretty quickly from the huge number of opinions and the great passion behind the answers. With many ways to monitor and many aspects of the system to monitor, you can get many opinions on this subject. Having so many options and opinions is not a bad thing, but how do you sort through the ideas to find something that works for you and your situation? Here, and in the next several articles, I present my views on the subject as they exist today. (I reserve the right to change my mind, and I reserve the right to be completely wrong.)
In this first article, I talk about monitoring from the perspective of understanding what is happening in the system. Later, I will discuss monitoring with an eye toward improving application performance (profiling). However, how do you profile across a cluster, and how do you make it part of a “job”? I think this is a developing area of monitoring, and one that is ripe for innovation. (Hint: It sounds like a “Big Data” problem.) Along the way, I’ll also point to different monitoring tools, and I hope to provide a few examples.
There is no such thing as a “best” monitoring tool; rather, the best tool is the one you can use and understand and the one that answers your questions (or at least helps answer questions). If you have a tool or tools that you use or like, I encourage you to write about it and explain what you monitor and why that tool suits your needs. In this and future articles, I will try to stay neutral and focus on the technology and ideas. I’m not out to write a new monitoring system but rather use existing tools and possibly modify them to help answer my questions.
With that, I’ll begin by asking the most fundamental and important question around monitoring: “What should I be monitoring?”
What to Monitor
Rather than trying different tools to see what they can do, I like to start by asking the simple question: “What am I interested in knowing?” I’m not immune to trying tools first before asking the hard questions, but I want to focus on the subject with a little more scrutiny.
One of the first things I like to know is whether a node is up or down. This sounds simple, but it’s not. Has the node crashed with a kernel panic (doesn’t respond to pings)? Does the node respond to pings, yet you can’t log into it? Or can you log into the node but can’t start a job? Is the central storage system not mounting on the node? Finally, what exactly constitutes a node being up or down? The answer to that question is really up to you based on the situation; however, my opinion is that if a node can’t run a job (an application), that node is “down.”
However, not being able to run a job encompasses many system aspects, such as network connectivity, storage availability, user authentication, a running OS. Developing a way to capture all of this in a single metric is not easy. You have many simpler options that determine whether a node is up or down. Techniques such as pinging a node or running a small script on a node can inform the master whether a node is alive. An alternative way to do this is to have the master node ssh (e.g., uname -r) to see if a node is alive. If the command runs successfully, the node is “alive.” This can also capture slow-running nodes that have some issues and can’t complete the command for a long time (e.g., if the node is swapping). Although it won’t capture every possible way a node is incapable of running a job, it will capture quite a bit of information.
Another approach is simply to create a short job that can be submitted to nodes between user jobs, and it can be more than just a simple piece of code; for example, it could do some housekeeping tasks before launching the next user job. You can even couple this approach with the simpler methods mentioned before.
After determining whether a node is up or down, I like to monitor resource usage, because clusters can be thought of as a set of resources: processors, memory, network, local disks, central storage, software. When a cluster is designed and purchased, you estimate what combination of resources are needed for the applications you intend to run. Being able to monitor resource usage allows you to understand whether or not you selected the right resource combination. In other words, you get feedback on system design, possibly helping to justify the expenditure and helping in the design of future systems.
Because storage monitoring is a subject unto itself, I separate it out from other monitoring aspects. At this point, then, I’m monitoring the following attributes:
- Node up or down?
- Node resources:
- Processor usage
- Memory usage
- Network usage
This list looks fairly simple, but perhaps it’s not. Just as you have to define what is meant by a node being up or down, you need to define what is meant by “usage” for each of these attributes and maybe even develop metrics for them.
Processor Usage
Processors commonly used in HPC have more than one core per socket, so do you want to gather information about the usage of each core, or are you more interested in the overall usage of the processors? The answer to this question depends on how you use your cluster.
Resource managers, commonly called job schedulers, can be configured in different ways. In one instance, they can be configured so that a user gets to use all of the cores on a node (i.e., the node is shared by users). They can also be configured so that they don’t necessarily get all of the cores on a node so another job can use the other cores (i.e., the node is shared by jobs).
At a high level, I would like to see the overall CPU usage of the node. This can be in the form of a total level of usage or a “load,” such as what you would see with the uptime or top commands. I use this information for a high-level overview of the processors. However, I also like to examine the load on each core in a node. In the case of a user “owning” a node, this information can tell you whether or not the user is actually using all of the cores. In the case of users sharing the nodes, this information tells you how much processor time the applications are using. Having all of this data might allow you to tweak the scheduler configuration, or in the future, it might allow you to buy processors with fewer cores or more cores.
Memory Usage
Measuring memory usage is a little more tricky, depending on what you want to measure. For example, do you want to measure the memory usage of each user application that is running on the node, or do you want an overall view of the memory usage on the node? How you answer questions such as these can greatly help what you measure and how you measure it. However, because of the design of Linux, it is not easy to measure memory usage by applications.
In general, Linux likes to grab memory for buffers and caches that are not necessarily being directly used by user applications, although user applications get some benefits from them. Moreover, sharable objects, which are generally shared libraries, are loaded into memory and used between applications. How does their memory usage get “shared” across the processes that use it? All of these aspects complicate the problem of determining how much memory user applications are using.
Measuring memory usage is not a trivial task. In fact, determining how much memory is being used by user applications can be a very difficult task. To get a better understanding of what you can measure and how you measure it, you should read about the various tools that can be used for reporting memory usage. Here is a short list of articles that might help.
- Understanding Memory Usage on Linux – A good first article.
- Understanding free Command in Linux/Unix – A good article for understanding what the free command tells you.
- Find Memory or RAM and Swap Usage in Linux – A short article on what information free gives about memory usage.
- Linux Ate My RAM
- Experiments and Fun with the Linux Disk Cache
My advice is that you can get a reasonable idea of how much memory the node is using, not including buffers and cache, by using the free command and subtracting some numbers. The Understanding free command in Linux/Unix article helps you understand what you can measure easily with standard Linux commands. At the very least, you’ll be able to measure how much memory is being used minus the buffers and caches, which will include all user applications, root applications, and shared libraries. It might not be exactly what you want, but getting something more detailed or granular requires a great deal more work.
Network Usage
Monitoring network usage can be a double-edged sword. Generally, what you are interested in is making sure the network is not an unplanned bottleneck on application performance, which makes monitoring more difficult, because how can you monitor this? The answer many people have created is monitoring network usage. If network usage reaches a very large percentage of its capacity over an extended period of time, then it’s likely the network has become a bottleneck.
Remember that the design of the network also can create bottlenecks. For example, if you use oversubscription at certain points in the network topology, you can create a bottleneck. Ideally these bottlenecks don’t unduly affect application performance. Therefore when you monitor your network, you expect these portions of the network to be somewhat saturated a reasonable amount of time.
A large number of tools can monitor network performance. All of them report usage in some form for packets that are transmitted (sent from the node to other nodes), and received (packets received from other nodes). Many of the tools report network usage in packets per second for both transmits and receives, and some can also report bandwidth for both transmits and receives. These tools primarily exist for Ethernet networks. Equivalent methods exist for gathering network statistics for InfiniBand networks, and I’m sure proprietary networks have similar methods.
Frequency of Data Gathering
Donald Becker, one of the originators of Beowulf clusters, created a very innovative cluster tool called BProc. I remember an email exchange on the Beowulf mailing list in which Don was talking about monitoring, and he made a point that I think a few people missed. His point was: Why do you need very frequent updates of a monitoring metric when that metric doesn’t change or changes very little? His example was getting updates once a second on a 15-minute load average for a node. Wouldn’t it be better to get an update on that metric every few minutes or even every 15 minutes?
I think Don’s point is well taken. In addition to defining monitoring attributes and metrics that are important to you, you also need to think very carefully about how frequently you get updates on those metrics. More than likely you’d like to store the metrics for some period of time so you can collect a historical account of the node. Keeping lots of metric values updated very frequently means you have to store and process a great deal of information.
Reviewing the original metrics, I’ll discuss the frequency at which to gather each:
- Is the node alive or dead? (up or down)
- Node resources:
- Processor usage (per node and per core)
- Memory usage (total and possibly per process or per user)
- Network usage (per Ethernet interface and/or InfiniBand)
Assume you’re going to store values associated with these metrics for a long period of time (at least a year). The frequency at which to capture data really depends on your situation and your user base and workloads. For example, if your user base typically runs applications that execute for hours or even days, then you might be able to capture data about whether a node is alive or dead infrequently, perhaps every 5-15 minutes, or even longer if you like. Checking it once every 3 seconds seems a bit like overkill to me.
If you have users with jobs that execute in less than an hour or in a few minutes, you might want to get more frequent information about the state of nodes. For jobs that run in less than 15 minutes, then, you might want updates on the node status every 10-60 seconds. If you have a large number of jobs that run in less time, you might want to consider getting information even more frequently. However, don’t try to capture node data too fast, because it will increase your storage requirements and also put more pressure on your network.
The same approach is good for the node resource metrics: processor usage, memory usage, and network usage. If the jobs run fairly quickly, then update these metrics fairly frequently. If the jobs run longer, you can gather these metrics less frequently.
Other factors affect the frequency at which you collect these metrics, including the number of nodes in the system. The more nodes you have, the more data you collect, so if all of the nodes send data at approximately the same time, the receiving node will have to deal with lots of incoming data. This has to be taken into account when designing the system or determining the frequency at which you gather metrics.
Complimentary to the number of nodes in the systems is the network itself. It can become an obvious limitation in monitoring if you push too much data too quickly onto the network. Based on the design of the network, you will be limited in how much monitoring data you can push to the master node (which I assume is doing the monitoring). For example, if you only have a Fast Ethernet (100Mbps) network for your monitoring data, then you will be limited in how many metrics, how frequently, and how many nodes can push data to the collection node. A shared GigE (1,000Mbps) network used for application communications (MPI), provisioning, and monitoring can also limit the data received and frequency at which it is gathered. As previously mentioned, oversubscribed networks are another source of bottlenecks, so you need to be careful.
That said, many of these obstacles have been fairly well conquered by admins with tools that gather monitoring metrics, allowing them to keep an eye on what is happening within the cluster. However, the number of nodes and the number of cores per node are rapidly increasing, pushing the data collection/gathering capabilities. Moreover, if you start layering additional data gathering (e.g., storage information and application performance information), you can quickly overwhelm your monitoring capability. My warning is to be cautious and really think about where the systems are heading and what sort of monitoring information might be needed.
One last word of advice has nothing to do with clusters, but I learned this a long time ago when I first became an admin. My advice is to think of the phrase “manage up.” That is, think about the information your manager or his manager will want, then focus on being able to deliver that information. I found that being able to tell my manager classic metrics, such as the percentage of the system that was running jobs at any point in time or how long nodes were down and not running jobs, went a long way in making their lives better. On the other hand, my manager was not too interested in how much memory was used unless it affected the ability to run jobs. At the same time, you have to be responsive to your users, because they are your customers. If they aren’t happy, they will push the issue up their management chain, which will come over to your management chain and down to you. Having processor, memory, and network usage data will definitely help when these problems come about (and they will come about).
Software Monitoring
So far I’ve only focused on hardware monitoring – specifically, the processor, memory, and network. At some level, however, you need to think about software monitoring. As a starting point, two primary things you should be monitoring are the resource manager (job scheduler) and the tools (software packages/versions) users are using.
Monitoring the jobs running on the cluster and those in the queue can tell you a great deal, including how long a job ran, who ran them, when the job was submitted, when the job started (and how long it sat in the queue), how many jobs are in the queue at any one time, which queues have the most jobs waiting (if you have multiple queues), and the most popular day of the week and time of day jobs are submitted. The job scheduler provides a great deal of information you can use to understand what your system is doing and how to improve its performance.
A simple example of job scheduler monitoring is a blog post from Harvard University’s Faculty of Arts and Sciences Research Computing Group (FAS RC) that explains how they mined the logs of their resource manager and created some very interesting plots showing how the number of jobs submitted increased over time, the day of the week with the most job submittals (it was Thursday, and Saturday had the fewest number of submittals), and the time of the day people submitted jobs (the peak was around 4:00pm, and the low point was around 6:00-7:00am). Just imagine having that information about your cluster. If you use a charge-back model, you could charge more to submit jobs at certain times and charge less to submit them at other times. In some ways, this becomes something like an electrical utility that has different rates at different times of the day.
The ability to monitor what software packages users are utilizing is also very powerful. In HPC systems, Environment Modules are typically used that allow you to change your OS environment to suit your needs. These modules allow you to specify a particular compiler and set of libraries used to build an application. This means you can experiment with different tools while running your jobs. You can do the same thing for applications, allowing you to have different versions of the same application available for use. Environment Modules allow you to monitor what packages and tools users are using (and I’ve written about a possible method for doing this). Again, it is based on what Harvard’s FAS RC team does (see “Scientific Software as a Service Sprawl” part 1 and part 2). This type of information is also invaluable in understanding what is happening on your system.
Pat Your Head and Rub Your Stomach at the Same Time
The fun part is taking all of this data and creating information from it. How can you mix resource usage data from the nodes with application and tool usage information? How can you mix in data about job queues as well? What will this information tell you? What kind of knowledge can you gain? These are the types of questions you should be interested in around HPC systems [again, sounds like a Big Data problem).
I hope to expand on this introduction to monitoring in a few more articles that talk about what metrics to gather, what tools to use, and how to put it all together. If you have any suggestions, I’d love to hear about them.
The Author
Jeff Layton has been in the HPC business for almost 25 years (starting when he was 4 years old). He can be found lounging around at a nearby Frys enjoying the coffee.