« Previous 1 2 3 4 Next »
Command-line tools for the HPC administrator
Line Items
whereis and which
The $PATH variable in Linux and *nix tells you the directories or paths that the operating system will use when looking for a command. If you run the command voodoo
and the result is an error message like can't find voodoo
, but you know it is installed on your system, you might have a $PATH problem.
You could look at your $PATH variable with the env
command, but I like to use the simple whereis
command, which tells you whether a command is in $PATH and where it is located. For example, when I looked for perl
(Figure 3), the output told me where the man pages were located, as well as the binary.
Think about a situation in which your $PATH is munged, and all of a sudden, you can't run simple commands. An easy way to discover the problem is to use whereis
. If the command is not in your $PATH, you can now use find
to locate it – if it's on the system.
Another useful command, which
, is very helpful for determining what version of a command will be run when executed. For example, assume you have more than one GCC compiler on your system. How do you know which one will be used? The simple way is to use which
, as shown in Figure 4.
One way I use which
quite a bit is when I create new modules for lmod
, and on more than one occasion, I have damaged my $PATH so that the command for which I'm trying to write a module isn't in the $PATH variable. Therefore, I know I managed to munge something in the module.
I promise you that if you are a system administrator for any kind of *nix system, HPC or otherwise, at some point, whereis
and which
are going to help you solve a problem. My favorite war story is about a user who managed to erase their $PATH completely on a cluster and could do nothing. The problem was in the user's .bashrc
file, where they had basically erased their $PATH in an attempt to add a new path.
lsblk
When I get on a new system, one of the first things I want to know is how the storage is laid out. Also, in the wake of a filesystem issue (e.g., it's not mounted), I want a tool to discover the problem. The simple lsblk
command can help in both cases.
As you examine the command, it seems fairly obvious that ls
plus blk
will "list all block devices" on the system (Figure 5). This is not the same as listing all mounted filesystems, which is accomplished with the mount
command (which lists all network filesystems, as well).
The default "tree" output shows the partitions of a particular block device. The block device sizes, in human-readable format, are also shown, as is their mount point (if applicable). A useful option is -f
, which adds filesystem output to the lsblk
output (Figure 6).
kill
Sometime in your administrative career, you will have to use the kill
[12] command, which sends a signal to the application to tell it to terminate. In fact, you can send a host of signals to applications (Table 2). These signals can accomplish a number of objectives with applications, but the most useful is SIGKILL.
Table 2
Process Signals
SIGHUP | SIGUSR2 | SIGURG |
SIGINT | SIGPIPE | SIGXCPU |
SIGQUIT | SIGALRM | SIGXFSZ |
SIGILL | SIGTERM | SIGVTALRM |
SIGTRAP | SIGSTKFLT | SIGPROF |
SIGABRT | SIGCHLD | SIGWINCH |
SIGIOT | SIGCONT | SIGIO and SIGPOLL |
SIGFPE | SIGSTOP | SIGPWR |
SIGKILL | SIGSTP | SIGSYS |
SIGUSR1 | SIGTTIN | |
SIGSEGV | SIGTTOU |
I call SIGKILL the "extreme prejudice" option. If you have a process that just will not die, it's time to use SIGKILL:
$ kill -9 [PID]
Theoretically, this should end the process specified, but if for some crazy reason the process won't die (terminate), and you need it to die, the only other action I know to take is to shut down the system. Many times this can result in a compromised configuration when the system is restarted, but you might not have much choice.
As with whereis
and which
, I can promise that you will have to use kill -9
to stop a process. Sometimes, the problem is the result of a wayward user process, and one way to find that process is to use the commands mentioned in this article. For example, you can use the watch
command to monitor the load on the system. If the system is supposed to be idle but watch -n 1 uptime
shows a reasonably high load, then you might have a hung process taking up resources. Also, you can use watch
in a script to find user processes that are still running on a node that isn't accessible to users (i.e., it has been taken out of production). In either case, you can then use kill -9
to end the process(es).
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)