Top Top-Like Tools
The Tops
One of the first lessons I learned when I became an admin was that you don't always have a nice GUI console to servers, particularly if the server is misbehaving (i.e., not acting normally). Problems that crop up usually mean no X Window system or any other sort of GUI access to the server. Often, this also means that monitoring tools such as Ganglia [1] aren't giving you much or any information.
Typically, you can only manage either a simple SSH login or maybe a crash cart connected to the server, or maybe a KVM (Keyboard, Video, Mouse) connection to the server. Moreover, most of the time in the HPC world, the compute nodes don't have a graphics card suitable for running a GUI. Therefore, you are left with a simple ASCII terminal window.
What tools can help you? Fortunately, Linux and other *nix operating systems come with some command-line tools that can help you diagnose the problems.
Interestingly, these common *nix tools have spawned the development of similar tools with added capability or slightly different features. Although the original *nix tools are really useful, many of these lookalike tools are outstanding.
If I only have terminal access to a misbehaving server, either through an SSH login or maybe a crash cart plugged into the server, the first thing I do is to run the command top
. In this article I want to cover what Top does and what other Top-like tools are available. Some of these tools may be familiar and some may be new, but I've found them to be very useful and sometimes wildly creative.
top
When I get a login to the server, the first tool I run is Top, because I get a quick summary of the status of the system. Let me explain with an example. Figure 1 is a screen shot of my desktop when I was running Python code test3.py
(a long-running processor- and memory-intensive piece of Python code).
At the top of the image is a summary area of five lines. The first line, shown in Figure 2, presents a quick status of the system overall. The first number is the current time (10:47:03). The second number is how long the system has been up (28 minutes), how many users are on the system (12 users – just a number of terminals in my case), and the 1-minute, 5-minute, and 15-minute loads on the system.
The second summary line (Figure 3) lists the number of total tasks (273), the number of running tasks (3), the number of tasks sleeping (270), the number of tasks stopped (0), and the number of zombie tasks (0).
The third summary line (Figure 4) presents CPU information. Moving left to right, the first number here is the percent CPU from userspace (%us, i.e., user applications), which is 13.4% in my example. The second number is percent CPU load from the system (0.3%sy), and the next is percentage of jobs that are "nice" [2] (0.0%ni). After that, Top lists percent overall CPU time idle (86.3%id; four real cores and four hyper-threading cores on this system) followed by percent overall CPU time waiting for I/O (0.0%wa), percent overall CPU time spent servicing IRQs (0.0%hi), percent time servicing soft IRQs (0.0%si), and percent overall CPU steal time (0.0%st).
The fourth summary line (Figure 5) is devoted to physical memory statistics. Reading from left to right, the first number is the total amount of memory (32,811,624KB, or about 32GB). The second number is the amount of memory used (3,196,192KB, or about 3GB). Next is the amount of free memory (29,615,432KB, or about 29GB), and the last number is the amount of memory used by kernel buffers in the system (66,004KB, or about 66MB).
The fifth summary line (Figure 6) focuses on the swap space in the system. From left to right in my example are the total amount of swap space (1,986,000KB, or about 2GB), the amount of swap space used (0KB), the amount of swap space free (1,986,000KB, or about 2GB), and the amount of cached memory used (942660KB, or about 1GB).
After the summary section is the process section. In this section, all of the running processes that can fit on the screen are ordered by CPU usage, with the largest first down to the smallest. The columns here correspond to the list in Table 1, which also indicates the values for the columns associated with my application, test3.py
:
Table 1
Process Section of Top Output
Heading | Description | Application Value | Note |
---|---|---|---|
PID | Process identifier [3] | 4211 | |
USER | Username of the owner of the process | laytonjb | |
PR | Priority of the process | 20 | |
NI | Nice value of the process | 0 | That is, the application isn't niced |
VIRT | Amount of virtual memory used by the process | 1405m | 1,405MB |
RES | Amount of physical memory used by the process | 1.0g | 1.0GB |
SHR | Shared memory used by the process | 12m | 12MB |
S | Status of process | R | S = sleeping, R = running, Z = zombie |
%CPU | Percent CPU being used by the process on a per-CPU basis | 100.0% | That is, it is using 100% of the core |
%MEM | Percent memory used by the process | 3.3% | Percentage of the total memory available |
TIME+ | Total time of process activity | 0:51.77 | 51.77 seconds, meaning he application had just started |
COMMAND | Name of the process | test3.py
|
Top outputs a great deal of information in a small amount of space, which is exactly why I use it: to get a quick overview of what is happening on the server. In this particular case, not too much is happening that's critical, but the server isn't misbehaving.
In the case of a misbehaving server or application, I determine the state with a quick look at a few key values, such as the load average on the top line to see if the load is much higher than expected. If the load is greater than the number of cores, you might suspect that an application is running away or perhaps swapping.
On the second line, I look for zombie processes (never a good thing), and I look at the number of applications (total tasks). If the data on that line looks logical, you probably don't need to worry too much.
On the third line, I examine the system (%sy), nice (%ni), and I/O wait time (%wa) percentages to see if the system is having issues. In particular, I'm looking at the I/O wait time. If it is too high, it means the application is waiting on I/O and that the I/O is possibly a bottleneck. At the same time, I like to watch the system CPU percentage. If it is larger than normal, I know something is going on with the system. For example, if the node starts swapping, this CPU percentage will go up.
On the fifth summary line, the metric I focus on is the used swap space, a high number indicates that the system is swapping, which can be a root cause of a system running really slowly. I also like to look at the amount of free memory in the fourth line. If this number is really low and the buffer and cached numbers are low, the system might be running out of memory.
In the process section, I look first at the top few applications to see whether any system applications are near the top, perhaps indicating a problem. I also like to look at the %CPU and %MEM for all user applications.
I've used Top enough that I can scan these numbers and quickly note potential sources of problems. Top has saved my bacon more than once.
htop
People have created various versions of Top, and one of the better ones is called htop [4]. Htop is a bit more interactive than Top, but it provides very similar information. The screenshot in Figure 7 is htop running on my desktop while running the test3.py
code.
Htop uses ncurses [5] for the interface but reads the data from /proc
as Top does.
Building htop is fairly easy. I downloaded the latest version (1.0.3) from the htop web page and then followed the usual rules of ./configure; make; make install
. Htop installs by default in /usr/local/bin
; make sure this is in your path if you want to use htop without specifying the entire path to the command.
Htop has some advantages relative to Top. The website [6] lists the following differences:
- In htop, you can scroll the process list vertically and horizontally to see all processes and full command lines.
- In Top, you are subject to a delay for each unassigned key you press (especially annoying when multikey escape sequences are triggered by accident).
- Htop starts faster, whereas Top seems to collect data for a while before displaying anything.
- In htop, you don't need to type the process number to kill a process – in Top, you do.
- In htop, you don't need to type the process number or the priority value to re-nice a process – in Top, you do.
- In htop, you can kill multiple processes at once.
- Top is older, hence more used and tested.
I've found htop to be more useful in filtering and displaying details of what is running than Top. Figure 8 shows what happens when you hit F2 after htop starts. By using the arrow keys, you can add and remove certain metrics (Meters) from the display.
Another feature I use often use is the filter option. By pressing F3, you can filter processes by UID. In Figure 9, I filtered the processes by laytonjb. The output is a nice tree of processes that you can collapse by pressing F6, as shown in Figure 10.
Htop is a flexible version of Top that gives you much the same information, but it has lots of flexibility, allowing you to customize your view of what's happening on the system. You can also do many things from within htop without having to open another shell or dropping out of Top. For example, if I see a process that I need to kill, I can do that from within htop. If the process can't be killed or shouldn't be killed, I can just nice the process, allowing other processes to have a higher priority.
atop
Another really good effort in creating an enhanced Top tool is called atop [7]. Although it retains Top's concept of a system summary at the top and processes listed below, it rearranges the interface a bit and makes some additions. These extra items really make it more of a general ASCII monitoring tool than a Top-like tool. Nonetheless, I consider it to be in the Top family.
As with many tools, you can use lots of options with atop (Figure 11), but by default, you get the split-screen view. The top portion summarizes the state of the system, and the lower half lists the processes.
The summary at the top covers processor, memory, disk, and network information. Atop adjusts the information according to the screen size, which can be a little disconcerting, because some information might not be where you expect it if the window size changes; however, it is very handy because it gives you useful information regardless of screen size (within reason).
One place I've found atop to be very useful is if you are using low-resolution displays to log in to a node. For example, I've used my seven-inch tablet to SSH into misbehaving servers, and I've found atop to be extremely handy, because I can get lots of useful information in a small screen – although the silly on-screen keyboard drives me nuts, requiring me to use a Bluetooth keyboard.
I won't cover the screen information of atop in detail because there is a lot of it. I do recommend reading the man page [8] for more details. When I started using atop, I was a little overwhelmed by the man page, so I just ran a variety of applications and used atop to monitor them. In this way, I was able to get a feel for what aspects I naturally watched for various types of applications (e.g., an application with lots of I/O, lots of network communication, or lots of memory bandwidth). This type of experimentation helps you develop habits with any given tool.
Atop has several options you can use to change the information what appears in the bottom half of the screen. The atop page [9] has some screenshots that show what happens when you invoke some options. For example, you can use the s key to look at scheduling details, the m key to see memory usage, d to look at disk usage, v to look at the "variable" information, c to show the command line for various processes, p to see accumulated information on a per-process basis (e.g., CPU consumption, memory consumption), u to see system information on a per-user basis (one of my favorites), and n to see network information on a per-process or per-thread basis if you use an optional module called netatop.
Buy this article as PDF
(incl. VAT)