GNU Parallel
GNU Parallel brings the benefits of multicore processing to the command line.
Recent trends in computing are toward more cores doing more tasks at once. These days, you are likely to have a dual- or quad-core CPU in your laptop, and perhaps 4, 6, 12, or 16 cores in your server and desktop machines.
With all these CPU cores floating around your home network, it might seem hard for you to make the most effective use of your machines at the command line. After all, the command line normally processes things in serial – one command at a time.
To help take advantage of modern multicore machines, CPU-intensive programs have been modified so they can run many subtasks in parallel. For example, instead of using bzip2 to compress a tarball, pbzip2 allows you to use many cores at once for the compression. Likewise, video and audio encoding can normally take advantage of more than one CPU core during the encode process.
GNU parallel [1] helps you take advantage of the CPU cores on your local machine, and you can even distribute the load to many machines on the network. Remote command execution uses SSH to get access to other machines, and Control Master connections [2] can help you avoid frequently spawning new SSH connections.
If you are already familiar with the xargs command [3], you can use GNU parallel as a drop-in replacement and start executing many things at once right away.
Surprisingly, the parallel command was not packaged for Fedora 13 or the Fedora Rawhide development version at the time of writing this article. Luckily, parallel itself is a Perl script, and it can be installed easily with the source tarball and the ./configure; make; sudo make install command sequence. Documentation is available in both HTML format and as a man page.
Getting into It …
At its core, GNU Parallel is about executing programs on files. In the most common usage, parallel takes a list of files as its standard input and a command to execute on each file.
To use Parallel this way, you normally run the find command to get the list of files, perhaps with the use of some filtering with find to remove the files you don’t want in the list, and then pipe that file list to Parallel. Parallel will take a file name from the list and execute the command on it, then move to the next file name and execute the command on that file, and so on.
Because some commands are very quick to execute, it can take just as long or longer to start the program as it takes for the program to process the file you pass it. To make such cases faster, you can pass the -m command line option to Parallel to tell it that the command can accept multiple file names at the same time. When using -m, Parallel will gather up multiple file names and pass them to your command.
If you were to run Parallel with many, many file names, at some stage, the command line will become too long for your system to process. Before that happens, Parallel will execute the command and then start over with a new invocation, building the command line again. Therefore, for 40,000 files, you might have your program execute four times, each time processing 10,000 file names. To see some of these limits for your Linux installation, run parallel with the --show-limits option.
Instead of taking a list of files on standard input, Parallel can also take a list of the entire set of commands. In this way, you can use Parallel to execute many commands at once. To do so, just put all the commands in a file and send it to parallel as shown in Listing 1.
Listing 1: Running Many Commands at Once
01 $ cat commands 02 date 03 hostname 04 echo foo | md5sum 05 06 $ cat commands | parallel 07 Thu Jul 8 10:37:20 EST 2010 08 desktop‑machine 09 d3b07384d113edec49eaa6238ad5ff00 ‑
One reason to use Parallel in this way is so that output of each command is grouped such that multiple commands do not mix their output streams together. For more information, see the grouping options later in this article.
The most basic and common use of parallel is shown in Listing 2.
Listing 2: Executing Parallel on All Files
01 $ find . ‑type f ‑print | parallel echo {} 02 ./xterm_32x32.xpm 03 ./xterm_48x48.xpm 04 ./xterm‑color_32x32.xpm 05 ./xterm‑color_48x48.xpm 06 ./xsane.xpm
To begin, the find command is used to locate all the files in which you are interested, then the result is piped to the parallel command to do something with each file.
Using Many Cores at One Time
For testing, I used an Intel Q6600 quad core on the desktop and an AMD 620 quad core as the server machine.
The whole point of parallel is to execute many of these commands at once so that your system remains loaded and ultimately finishes the work sooner. The use of the --jobs/ -j option tells Parallel to run multiple tasks at once. The jobs option takes an explicit number of tasks to run at once, a number with a leading + or - sign to use as an offset from the number of CPU cores available, or a percentage of CPU cores to use. For example,
parallel -j +0
tells parallel to try to run exactly the number of tasks as there are CPU cores in the machine. Using
parallel -j 50%
will only half-load the machine, even if the hardware is changed to something with six instead of four cores. If you do not supply a -j option, the default of nine jobs will be run at once.
As an example, if you have five files in the current directory and add the directory itself to the output of find to get six entries, the first command in Listing 3 will delay about one second and then print all output before completing because the six entries are less than the default of nine jobs allowed to start at once.
Listing 3: Processing the Files in Parallel
01 $ find . | parallel ‑v 'echo $(ls ‑1 /proc/self/task); sleep 1s; echo {}' 02 echo $(ls ‑1 /proc/self/task); sleep 1s; echo . 03 10943 04 . 05 ... 06 $ find . | parallel ‑v ‑j +0 'echo $(ls ‑1 /proc/self/task); sleep 1s; echo {}' 07 ?
The second command will delay a second and print the first four files and delay another second before printing the remaining two files. The delays come from the sleep instruction, which is included for illustration purposes.
The for loop written in the shell script at the top of Listing 4 will recompress every gzip file in the directory into a bzip2 file.
Listing 4: Recompressing Files in Parallel
01 $ time for if in *gz; 02 do 03 zcat $if | bzip2 ‑9 > $if.bz2; 04 done 05 real 0m27.005s 06 user 0m11.745s 07 sys 0m14.623s 08 09 $ time find . ‑name "*.gz" ‑print | parallel ‑j +0 'zcat {} | bzip2 ‑9 > {.}bz2' 11 12 real 0m14.921s 13 user 0m15.955s 14 sys 0m24.478s
The parallel command shown in line 9 does the same thing but will execute four bzip2 processes at once. Although I have four CPU cores, I only get closer to twice the overall speed advantage by using Parallel in this case. In the parallel command, I have used substitution brackets, , to insert the file name twice. The second time, the use of . means that Parallel will remove the file name extension before substituting it, making this quite a clean way to trim off the .gz postfix and append .bz2.
While you are getting to know Parallel, you might want to use the --verbose/ -t/ -v options, which print the command to standard error or output ( -v) before executing it. Of course, echo is also your friend here, so you can make the command not actually perform anything, just in case the result is not what you expected.
If you want to see what is happening along the way, the --progress option will show you how many tasks are being executed and where. If you forget to specify --progress and want to see what is happening, send a SIGUSR2 to Parallel while it is running to turn on progress reporting. The --eta option implies --progress and gives an “estimated time of arrival,” indicating approximately when all the tasks will complete.
When executing multiple commands in Parallel, each task might output information on its standard output and error during the course of it’s execution. Normally this would just be mixed together in the output streams, causing the information to be presented on the terminal all mixed up, and you would not know to which task each line belongs.
The --group/ -g option tells Parallel to collect the standard output and error streams from each task and only print them when the task has completed. In this way, the streams are not mixed together; each task’s output is presented in a contiguous block on Parallel’s output.
When you start more than one job with the -j option, it is reasonable to assume that each job might not take exactly the same amount of time to complete. If you care about seeing the output in the order that file names were presented to Parallel (instead of when they completed), use the --keeporder option.
With the use of --number-of-cpus and --number-of-cores comman-line options, you can see how many CPUs and cores Parallel thinks your machine has.
If you are embedding the substitution into another string to form a file name, you might want to use -X instead of -m. As you can see in Listing 5, the use of -X causes the text that surrounds the to be repeated for each substitution, whereas with -m, all the file names are substituted and the xxx prefix and yyy postfix are only used once.
Listing 5: Multiple Files with Embedded { }
01 $ find . ‑type f ‑print | parallel ‑m echo xxx{}yyy 02 xxx./xterm_32x32.xpm ./xterm_48x48.xpm 03 ./xterm‑color_32x32.xpm ./xterm‑color_48x48.xpm ./xsane.xpmyyy 04 05 $ find . ‑type f ‑print | parallel ‑X echo xxx{}yyy 06 xxx./xterm_32x32.xpmyyy xxx./xterm_48x48.xpmyyy 07 xxx./xterm‑color_32x32.xpmyyy xxx./xterm‑color_48x48.xpmyyy xxx./xsane.xpmyyy
Assimilate Your Network Too …
Although the compression and transcoding of Parallel files normally can be handled with the use of a specially written binary, such as pbzip2, Parallel capabilities let you send tasks over the network too.
GNU Parallel handles remote host connections and authentication with the use of SSH. Each machine you want to connect with is specified with the use of the --sshlogin/ -S option, of which there can be many.
The argument to --sshlogin can be as simple as the host to connect with, optionally prefixed with the number of jobs ( n/ host) to send to that machine. Of course, you can also include complex SSH invocations or use another SSH tool if you like.
If your network is reasonably static, you might want to specify these hosts in a file and use --sshloginfile to select groups of hosts for processing via a single argument. GNU Parallel reads default command-line options from the environment variable PARALLEL and the file ~/.parallelrc, so you might want to nominate some workhorse hosts in your .parallelrc to keep command lines short.
When working with a collection of hosts, where data comes from and where it goes to become a little more complex. If you have a shared NFS filesystem that all hosts can access from the same path, then there is no problem. However, what if you want to recompress your /var/log directory on a given server?
To handle files that are only available on the local host, parallel has options to --transfer files to the remote host, --return a file from the remote host, and --cleanup some files on the remote host after processing.
The --transfer option will transfer the files that will be substituted with to the remote host. If you want to transfer additional files, use --basefile/ -B, which lets you explicitly nominate the files to send. You can have many --basefile arguments.
The --cleanup option will remove files sent for all --transfer and --basefile options. The --basefile option can be handy if you want to transfer a small script that is run on every file, instead of specifying all of the commands explicitly when running Parallel.
Note that a --basefile is only transferred to the remote host once for the whole Parallel invocation, so if you have a large dataset in CSV that your parallel script needs, it will only be sent over the network once, even if you process 1,000 files using it.
Because the transfer, return, and cleanup options are likely to be used together, you also have the --trc option, which transfers, receives, and cleans up a file. The --trc option itself has a single option, which is the name of the file to return from the remote host.
To get things rolling, I’ll look at an example that executes on four local and four remote CPU cores, as shown in Listing 6.
Listing 6: Eight Cores; Four on the Local Machine
01 $ seq 1 10 \ 02 | parallel ‑j +0 ‑‑sshlogin :,serverer \ 03 'echo {} $(hostname) ' 04 5 desktop 05 6 desktop 06 7 desktop 07 8 desktop 08 10 desktop 09 9 desktop 10 1 serverer 11 2 serverer 12 3 serverer 13 4 serverer
The command processes the numbers from 1 to 10 inclusive, running as many jobs on each machine as there are CPU cores. The --sshlogin option tells Parallel to use localhost ( :) and the serverer machine for execution; the command echos the number back and reports the host that it was executed on.
The ~/.parallelrc file has a slightly terse format, in that you have to put a “group” of options on its own line. I also had a bit of trouble getting the long command-line options to work, so I wound up using -S:,serverer instead of --sshlogin. The resulting file and more compact parallel command is shown in Listing 7.
Listing 7: A More Compact Command
01 $ cat ~/.parallelrc 02 ‑j+0 03 ‑S:,serverer 04 $ seq 1 10 | parallel 'echo {} $(hostname) ' 05 5 desktop 06 ...
A word of caution when using --transfer or --trc: relative file names are handled relative to your home directory. So if you are in /tmp/foo and operate on ./bar.txt, and Parallel sends this file to server zz, then you will be processing ~/bar.txt on zz.
What makes things worse for the unwary is that files are clobbered by default. So it is much better to use new users on remote machines who were created just to service GNU Parallel jobs. In this way, nothing important is in the home directory of those users, so you will not lose anything by accident.
It would be nice to be able to tell Parallel to always use a temporary directory of your choosing on remote hosts. For example, use something like /tmp/parallel-sourcehost-sourcepid as the base directory on the remote hosts not only to protect the home directory but also to place scratch files onto a filesystem, which the system administrator probably set up explicitly for that purpose.
Also, you might add --controlmaster/ -M to your command line to avoid the overhead of creating a new SSH connection for each job run on remote hosts.
Wrap Up
GNU parallel makes processing multiple files at once on a multicore machine easy. Having the output, if any, grouped so each process doesn’t mix its output with others takes the headache out of running tasks that produce output as they run.
When distributing to machines on the network, --trc makes simple send, process, and send back batch commands easy to formulate. But, consider whether you are transmitting large files to the remote system only to execute non-CPU-intensive operations on the data and return it. For example, when calculating the MD5 checksum for files, you are likely to saturate the disk I/O using only a quad-core local desktop machine.
Info
[1] GNU parallel homepage:
http://savannah.gnu.org/projects/parallel
[2] ssh_config manual (see the ControlMaster option):
http://www.openbsd.org/cgi-bin/man.cgi?query=ssh_config
[3] GNU xargs:
http://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-xargs.html
[4] GNU pexec homepage:
http://www.gnu.org/software/pexec/
Author
Ben Martin has been working on filesystems for more than 10 years. He completed his doctorate and now offers consulting services focused on the libferris suite, filesystems in general, and Qt/C++ development.