Monitoring HPC Systems: Process, Network, and Disk Metrics
In previous articles, I talked about cluster monitoring metrics and determining what you should monitor, then I looked at monitoring processor and memory metrics. In this article, I discuss three more classes of metrics – process, network, and disk – accompanied by simple code samples that implement these metrics.
Processes
As I mentioned previously, I use top quite a bit for quick glances into the status of my systems. Up to this point I have covered the processor and memory metrics that top uses, but I would also like to get some of the information it provides in regard to processes. Ideally, I would like to know the number of running processes, as well as how many are “sleeping.” Fortunately, the Python system utilities module psutil has this capability. A simple function, process_iter() returns an iterator (Python-speak) yielding a process class instance for all running processes on the local machine. You can then use this information to count the running and sleeping processes, which is the information I care about. The fairly simple script in Listing 1 is based on a top/htop clone script. (Note that the originating script is under a BSD-style license and is copyrighted by Giampaola Rodola, the author of psutils.)
Listing 1: Count Running and Sleeping Processes
#!/usr/bin/python try: import psutil except ImportError: print "Cannot import psutil module - this is needed for this application."; print "Exiting..." sys.exit(); # =================== # Main Python section # =================== if __name__ == '__main__': procs = []; procs_status = {}; for p in psutil.process_iter(): try: p.dict = p.as_dict(['username', 'get_nice', 'get_memory_info', 'get_memory_percent', 'get_cpu_percent', 'get_cpu_times', 'name', 'status']); try: procs_status[p.dict['status']] += 1; except KeyError: procs_status[p.dict['status']] = 1; except psutil.NoSuchProcess: pass else: procs.append(p); # end try # end for processes = sorted(procs, key=lambda p: p.dict['cpu_percent'], reverse=True); print "Total number of processes ",len(processes) print " Number of Sleeping procs ",procs_status["sleeping"]; print " Number of Running procs ",procs_status["running"]; # end main
Running the script is pretty easy and produces output like this:
[laytonjb@home4 1]$ ./proc_test.py Total number of processes 261 Number of Sleeping procs 260 Number of Running procs 1
If you need more detailed process information, I suggest you refer back to the memory metrics and use the ps_mem.py script to gather that information. Alternatively, you could refer to the top-like example code in psutil.
Network
Although network metrics are common, because every server is connected to some sort of network, the type, topology, and even number of networks vary from system to system. From an admin’s perspective, I typically use network monitoring to understand whether my systems are functioning; if they are functioning, they should be responding to network traffic. Moreover, I want to know what’s going on with my networks in general and whether they are performing well specifically (i.e., have there been any errors). Additionally, I would like to be able to get a time history of the network interface performance, so I can correlate it with processor data. Although many HPC systems now use InfiniBand, I don’t have any IB hardware for testing (I’m always open for donations though), so I will focus on Ethernet measurements you can use to capture network metrics.
Fortunately, Linux puts a great deal of Ethernet information in either /proc or /sys that can be parsed, which can make your life much easier or more complicated depending on your perspective. It’s easier in the sense that you have access to more information, but it’s difficult because you have more information and run the risk of information overload. As I mentioned in the first article, you need to pay attention to what you monitor and how you monitor it.
For networks, the metrics that I want to measure are:
- The send and receive throughput for the node
- The send and receive packet rate for the node
- For each interface:
- bytes sent (total) as well as bandwidth (bytes per second [bps])
- bytes received (total) as well as bandwidth (bps)
- number of packets sent (pckt-sent), total, and packets per second (rate)
- number of packets received (pkts-recv), total, and packets per second (rate)
- errors, including:
- dropped packets (send and receive)
- bad packets (send and receive)
- packet collisions
- other errors
- multicast packets received (which help understand whether multicast is being used)
The above is quite a bit of information, especially if you have multiple interfaces, but this data can help you understand whether the network is being used, how heavily, and whether it is experiencing any problems.
psutil for Network Metrics
The first network metric monitoring script uses the psutil library that I’ve used throughput these articles. The library has a function call that is particularly useful for monitoring overall Ethernet network use for the server. In Listing 2, I use that function to gather information about the Ethernet interfaces. I based the script on a real-time network statistics script that measures network usage. (Note: the originating script uses a BSD-style license and is copyrighted by Giampaola Rodola, the author of psutils.)
Listing 2: Ethernet Interfaces
#!/usr/bin/python try: import time except ImportError: print "Cannot import time module - this is needed for this application."; print "Exiting..." sys.exit(); try: import psutil except ImportError: print "Cannot import psutil module - this is needed for this application."; print "Exiting..." sys.exit(); try: import re # Needed for regex except ImportError: print "Cannot import re module - this is needed for this application."; print "Exiting..." sys.exit(); def bytes2human(n): # From sample script for psutils """ >>> bytes2human(10000) '9.8 K' >>> bytes2human(100001221) '95.4 M' """ symbols = ('K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y') prefix = {} for i, s in enumerate(symbols): prefix[s] = 1 << (i + 1) * 10 for s in reversed(symbols): if n >= prefix[s]: value = float(n) / prefix[s] return '%.2f %s' % (value, s) return '%.2f B' % (n) # end def # # Routine to add commas to a float string # def commify3(amount): amount = str(amount) amount = amount[::-1] amount = re.sub(r"(\d\d\d)(?=\d)(?!\d*\.)", r"\1,", amount) return amount[::-1] # end def commify3(amount): # =================== # Main Python section # =================== if __name__ == '__main__': # Before tot_before = psutil.net_io_counters() pnic_before = psutil.net_io_counters(pernic=True) # sleep some interval so we can compute rates interval = 0.2; time.sleep(interval) tot_after = psutil.net_io_counters() pnic_after = psutil.net_io_counters(pernic=True) # start output: print "total bytes:"; print " sent: %-10s" % (bytes2human(tot_after.bytes_sent)); print " recv: %-10s" % (bytes2human(tot_after.bytes_recv)); nic_names = list(pnic_after.keys()) nic_names.sort(key=lambda x: sum(pnic_after[x]), reverse=True) print "Interface:" for name in nic_names: stats_before = pnic_before[name] stats_after = pnic_after[name] print name; print (" Bytes-sent: %15s (total) %15s (Per-Sec)" % (bytes2human(stats_after.bytes_sent), bytes2human(stats_after.bytes_sent - stats_before.bytes_sent) + '/s' ) ); print (" Bytes-recv: %15s (total) %15s (Per-Sec)" % (bytes2human(stats_after.bytes_recv), bytes2human(stats_after.bytes_recv - stats_before.bytes_recv) + '/s' ) ); print (" pkts-sent: %15s (total) %15s (Per-Sec)" % (commify3(stats_after.packets_sent), commify3(stats_after.packets_sent - stats_before.packets_sent) + '/s' ) ); print (" pkts-recv: %15s (total) %15s (Per-Sec)" % (commify3(stats_after.packets_recv), commify3(stats_after.packets_recv - stats_before.packets_recv) + '/s' ) ); # end for # end main
This simple script gathers data for the entire node, as well as for the separate interfaces. An example of the output is below.
[laytonjb@home4 1]$ ./network_test1.py total bytes: sent: 2.15 M recv: 39.10 M Interface: eth0 Bytes-sent: 2.15 M (total) 0.00 B/s (Per-Sec) Bytes-recv: 39.10 M (total) 120.00 B/s (Per-Sec) pkts-sent: 22,334 (total) 0/s (Per-Sec) pkts-recv: 68,018 (total) 2/s (Per-Sec) lo Bytes-sent: 2.55 K (total) 0.00 B/s (Per-Sec) Bytes-recv: 2.55 K (total) 0.00 B/s (Per-Sec) pkts-sent: 44 (total) 0/s (Per-Sec) pkts-recv: 44 (total) 0/s (Per-Sec) eth1 Bytes-sent: 0.00 B (total) 0.00 B/s (Per-Sec) Bytes-recv: 0.00 B (total) 0.00 B/s (Per-Sec) pkts-sent: 0 (total) 0/s (Per-Sec) pkts-recv: 0 (total) 0/s (Per-Sec)
The output doesn't contain everything I want to know, but it does capture the higher level metrics for each network interface.
Sysfs Information for Networks
Sysfs is a virtual file system like /proc, but it focuses on devices and drivers from the kernel device model. It contains a great deal of information about network interfaces, since they use drivers, so you can exploit this fact to gather all kinds of useful information about server Ethernet interfaces.
I was encouraged by an article by Dan Nanni that discussed how to use Bash to parse statistics for Ethernet interfaces in /sys. I wrote a very simple Python script to do roughly the same thing, but the script is fairly long, primarily because of the large number of possible statistics and the output it generates (complete code online). The sample output in Listing 3 shows only one interface because the others aren’t that interesting. Note that the script loops over all interfaces, including the loopback device (lo).
Listing 3: Parsed Bash Statistics
[laytonjb@home4 2]$ ./network_test2.py Interface: eth0 total bytes: sent: 16.90 G recv: 114.04 M total packets: sent: 12,243,355 recv: 1,467,914 bytes sent: total: 16.90 G per-sec: 3.22 M/s bytes recv: total: 114.04 M per-sec: 16.31 K/s pkts sent: total: 12,243,355 per-sec: 2,309 pkts recv: total: 1,467,914 per-sec: 233 compressed pkts sent: total: 0 per-sec: 0 compressed pkts recv: total: 0 per-sec: 0 dropped pkts sent: total: 0 per-sec: 0 dropped pkts recv: total: 0 per-sec: 0 Bad pkts sent: total: 0 per-sec: 0 Bad pkts recv: total: 0 per-sec: 0 FIFO overrun errors sent: total: 0 per-sec: 0 FIFO overrun errors recv: total: 0 per-sec: 0 Frame alignment errors recv: total: 0 per-sec: 0 Frame alignment errors recv: total: 0 per-sec: 0 Recv missed errors: total: 0 per-sec: 0 Recv over errors: total: 0 per-sec: 0 Trans aborted errors: total: 0 per-sec: 0 Trans carrier errors: total: 0 per-sec: 0 Trans heartbeat errors: total: 0 per-sec: 0 Trans window errors: total: 0 per-sec: 0 Collisions while sending: total: 0 per-sec: 0 multicast pkts recv: total: 192 per-sec: 0 ...
Disk Metrics
Originally, I wasn’t going to write much about disk metrics, since HPC systems usually have either a single disk for the OS or no disk at all (i.e., diskless); however, the distinction between traditional HPC systems, Big Data systems, and Hadoop systems is blurring. As a result, some nodes in the cluster could have local storage systems that are important and need to be monitored. Because you will need metrics for these systems, I want to spend a little time looking at some simple disk metrics.
Although you have the choice of a large number of local storage metrics to monitor, I’m choosing to focus on a just a few, then I’ll illustrate some other metrics that you can get for “free,” courtesy of the tools I’m using. Good high-level metrics for local storage are pretty simple:
- Amount of storage used, free storage available, and total space (i.e., capacity metrics)
- Read and write bandwidth (MBps)
- Number of read and write operations per sec (I/O ops per second)
A number of other metrics are interesting for storage, such as the number of I/O requests that are merged and the merge rate – a “merge” meaning individual I/O requests waiting in the queue that can be merged together to create a single combined request. Merges can help improve bandwidth, but it can also hurt latency.
In addition to merge statistics, it is also useful to monitor the number of I/O requests completed and the rate at which they are completed. You can compare this metric to the number of requests, as well as the corresponding rates, to understand how efficiently the requests are being completed.
I’d also like to see some of these metrics on a per-filesystem basis and some on a per-device basis. However, you have to be careful of the “per-device” metrics, because if you have a large-ish number of devices in your node, you will get a great deal of information in return (i.e., information overload). However, I’ll take the risk and gather the metrics on a per-device basis where possible.
Psutil
In the first metric example, I’m going back to the psutil library I’ve used throughput this series of articles. The code uses the functions
- psutil.disk_partitions()
- psutil.disk_io_counters()
as well as the bytes2human function from the psutil example code. The code is a little long for this article, so I’ve posted it online. The output from the code on my poor desktop is shown in Listing 4.
Listing 4: Disk Metrics
[laytonjb@home4 1]$ ./1test1.py Device: /dev/sda1 Mount point: / FS Type: ext4 Options: rw Total Space 108.17 G Used Space: 17.21 G Free Space: 85.46 G Device: /dev/md0 Mount point: /home FS Type: ext4 Options: rw Total Space 2.69 T Used Space: 295.82 G Free Space: 2.26 T Device: /dev/sdb1 Mount point: /data FS Type: ext4 Options: rw Total Space 110.03 G Used Space: 59.62 M Free Space: 104.39 G sdc1 : Number of reads: 151,724 Number of bytes: 16.60 G Read Rate: 5.75 M/s Amount of time reading: 141,290 ms Number of writes: 17,177 Number of bytes: 148.16 M Write Rate: 0.00 B/s Amount of time writing: 172,592 ms sdb1 : Number of reads: 320 Number of bytes: 1.49 M Read Rate: 0.00 B/s Amount of time reading: 247 ms Number of writes: 1 Number of bytes: 4.00 K Write Rate: 0.00 B/s Amount of time writing: 0 ms sdd1 : Number of reads: 1,544 Number of bytes: 77.75 M Read Rate: 0.00 B/s Amount of time reading: 12,477 ms Number of writes: 18,263 Number of bytes: 148.16 M Write Rate: 0.00 B/s Amount of time writing: 223,137 ms sda2 : Number of reads: 382 Number of bytes: 1.49 M Read Rate: 0.00 B/s Amount of time reading: 40 ms Number of writes: 1 Number of bytes: 1.89 G Write Rate: 0.00 B/s Amount of time writing: 5 ms sda1 : Number of reads: 28,621 Number of bytes: 509.46 M Read Rate: 0.00 B/s Amount of time reading: 17,258 ms Number of writes: 5,963 Number of bytes: 45.60 M Write Rate: 0.00 B/s Amount of time writing: 5,845 ms md0 : Number of reads: 173,034 Number of bytes: 16.67 G Read Rate: 5.75 M/s Amount of time reading: 0 ms Number of writes: 37,734 Number of bytes: 147.37 M Write Rate: 0.00 B/s Amount of time writing: 0 ms
As you can see, I have three filesystems (from the first part of the listing), and the output lists the capacity (total, used, and free) for each. The next section shows the performance of six devices, of which five are physical devices and the last is an md device for a RAID-1 group.
/proc/diskstats
The Linux kernel puts a great deal of statistics in /proc/diskstats. If you cat /proc/diskstats, you will see various devices followed by a few numbers (11 fields in the kernel I’m using). Now, it’s just a simple matter of writing a script to parse the data and print it. The code is too long for publishing here, so you can get it online.
The virtual filesystem provides statistics for a very large number of block devices. For example, on my desktop, it will give you data on ram0-15; loop0-7; sda, sda1, and sda2; sdb and sdb1; sdd and sdd1; sdc and sdc1; sr0; and md0.
The statistics for all of these devices are not necessarily useful because I’m not doing any I/O to most of them. Therefore, I’m only going to list a few devices in the output from the script (Listing 5).
Listing 5: /proc/diskstats Parsed
... Device: sdc Reads Issued count: 153,337 Rate: 46 ops/s Reads Merged count: 11,285 Rate: 0 ops/s Read Sectors count: 35,177,038 Rate: 11,776 ops/s Time spent reading: 142,284 ms Writes completed: 11,285 Rate: 0 ops/s Writes Merged count: 11,285 Rate: 0 ops/s Write Sectors count: 304,974 Rate: 0 ops/s Time spent writing: 173,489 ms IO count in progress: 0 IO Time: 228,383 ms Device: sdc1 Reads Issued count: 153,146 Rate: 46 ops/s Reads Merged count: 11,261 Rate: 0 ops/s Read Sectors count: 35,175,318 Rate: 11,776 ops/s Time spent reading: 142,206 ms Writes completed: 11,261 Rate: 0 ops/s Writes Merged count: 11,261 Rate: 0 ops/s Write Sectors count: 304,974 Rate: 0 ops/s Time spent writing: 173,489 ms IO count in progress: 0 IO Time: 228,306 ms ...
You can see that device sdc, as a whole, is listed first, followed by the disk partitions. In my case, the disk only has one partition (sdc1).
iostat
As I mentioned at the beginning of the article, a number of toolkits provide lots of metric information. One of the most popular is sysstat. I decided to write a simple disk script that uses one of these tools, iostat, for measuring I/O statistics. The script runs iotstat, records the results to a file, and reads the file and writes the resulting statistics to stdout. The reason I don’t use the iostat output directly is that I wanted to filter out a fair amount of the output, and writing a script to do this was fairly easy. The script is a bit long, so I will post it soon for download (check back).
I want to grab a specific set of statistics from the system. If you are running an older copy of iostat, it might not match my metric script, so I recommend upgrading to the latest version of sysstat, which is not too difficult to build or install. In the output, I print information around reads and writes (number of operations merged and completed and rates for both). Listing 6 is an example of script output from my desktop computer.
Listing 6: Parsed iostat Output
[laytonjb@home4 3]$ ./3test1.py sda : Read reqs merged: 0.07/s Read reqs completed: 2.34/s Write reqs merged: 1.42/s Write reqs completed: 0.98/s Read BW: 0.05 MB/s Write BW: 0.13 MB/s Avg sector size issued 108.81 Avg queue length 0.00 Avg wait time for reqs 0.66 ms Avg Service time for reqs 0.29 ms sdb : Read reqs merged: 0.03/s Read reqs completed: 0.02/s Write reqs merged: 0.00/s Write reqs completed: 0.00/s Read BW: 0.00 MB/s Write BW: 0.00 MB/s Avg sector size issued 9.39 Avg queue length 0.00 Avg wait time for reqs 0.78 ms Avg Service time for reqs 0.78 ms sdc : Read reqs merged: 0.68/s Read reqs completed: 0.59/s Write reqs merged: 3.76/s Write reqs completed: 2.12/s Read BW: 0.01 MB/s Write BW: 0.02 MB/s Avg sector size issued 25.28 Avg queue length 0.02 Avg wait time for reqs 7.45 ms Avg Service time for reqs 5.01 ms sdd : Read reqs merged: 0.53/s Read reqs completed: 0.08/s Write reqs merged: 3.78/s Write reqs completed: 2.10/s Read BW: 0.00 MB/s Write BW: 0.02 MB/s Avg sector size issued 23.78 Avg queue length 0.02 Avg wait time for reqs 8.87 ms Avg Service time for reqs 5.99 ms md0 : Read reqs merged: 0.00/s Read reqs completed: 1.86/s Write reqs merged: 0.00/s Write reqs completed: 5.69/s Read BW: 0.01 MB/s Write BW: 0.02 MB/s Avg sector size issued 9.82 Avg queue length 0.00 Avg wait time for reqs 0.00 ms Avg Service time for reqs 0.00 ms
Summary
Sys admins need to understand what is happening on their systems so that problems can be corrected quickly. Monitoring data helps you understand how the system is being utilized so that the system can be tweaked or so the next generation system better conforms to how the current system is being used (i.e., you can justify a new system to management using real data and not just gut feelings or grandiose desires from users).
For cluster monitoring in particular, you should begin at the beginning, and that beginning is determining the critical metrics you want to monitor and what they mean to you. I tend to think in terms a hierarchy of metrics pyramid for the metrics I like to watch. The top of the pyramid is very simple: Are the nodes up or down? Can it run jobs (applications), or has it crashed?
The next layer down is understanding, at a gross level, how the cluster resources are being used. How much of the processors are being used? How much memory is being used? How much of the network resources and disk resources are being used? It’s very useful to be able to say that, on average, 85% of the available processor hours were used last month.
Moving down the pyramid adds more information, and hopefully insight, into how the cluster is running. As you move further down the pyramid, you accumulate more data, requiring that you think seriously about how and where to store it, how to retrieve it, and how to make sense of it all.
To help with information overload, you need to consider how frequently you want to measure your metrics. Do you really need to know the 15-minute load average every couple of seconds? Should you be monitoring the one-minute load average every few seconds for an application that runs for 12 hours? You need to think logically about how often you need to monitor your metrics because it directly affects the amount of storage you need and the amount of data you have to manipulate to understand your cluster.
Only after you seriously consider these questions, and possibly a few more, should you start writing code. I considered these questions in the first article, so in these last two articles, I focused on some simple code for measuring metrics for which I’m interested. I used Python as the coding language because it’s easy to use, it has lots of add-on capability, and many monitoring toolkits can adapt to it.
From the coding examples, I hope you learned a bit about how much information Linux provides already and how easy it us to gather that information. I also hope you realize how much information these scripts can provide. Most of them don't provide just a single measure; rather, they return quite a bit of information about all kinds of aspects of the system.
I intentionally focused on monitoring metrics for a single system. If I can get a handle on monitoring a single system, then I can adapt the scripts to a monitoring framework that allows me to monitor multiple systems (i.e., a cluster). In fact, that’s the subject of the next article.
The Author
Jeff Layton has been in the HPC business for almost 25 years (starting when he was 4 years old). He can be found lounging around at a nearby Frys enjoying the coffee.