Parallel and Encrypted Compression
Compression tools are underutilized. I base this statement on the observation that data, once created, is seldom accessed again. To me, this seems like a great opportunity to use compression tools to reduce the used storage capacity and perhaps save a little money.
In a previous article, I presented some tried and true compression tools (e.g., gzip ) and some newer ones (e.g., zstd ). Most were serial tools, yet with all the wonderful cores available, where are the parallel compression tools?
While researching compression tools, I discovered that some of them are capable of encryption. In this article, I want to mention briefly which tools can encrypt. Disclaimer: I’m not a security expert, so I can’t judge the strength of the encryption.
In this article, I also offer some advice with respect to compression, because I’m sure some readers will be asking, “What is the best tool?” (perhaps without defining “best”). My advice is based purely on my experience, so I have no test data to back it up.
Multithreaded (Parallel) Compressors
If you have ever compressed a large file, you know that it can take a long time to complete. Beyond the speed of the compression algorithm affecting the time it takes to compress a file, many compression utilities only use one thread (serial). Fortunately, many of these utilities have become multithreaded.
In this article, I use the phrase “parallel” to indicate that the application is multithreaded but not multinode; that is, they are parallel within a single Streaming Message Protocol (SMP) node. As you will see, though, one compression utility uses the Message Passing Interface (MPI) standard, so it’s parallel across distributed nodes. I will specifically call out this tool.
The general approach for any of the multithreaded utilities is to break the file into chunks, each compressed by a single thread. In general, when all chunks are finished, they are concatenated into the final compressed file in the file format of the compression tool.
Parallel Compression Utilities
Several of the compression utilities discussed in the previous article are multithreaded, but some utilities had threaded versions of the corresponding serial tools. A couple of cases had more than one threaded tool for a corresponding serial tool. For both categories, the tools are compatible with file formats of the serial tools. In the following discussion, I organize the tools by the file format used. For example, gzip and bzip2 are each specific categories.
Parallel gzip
The tools mentioned here are capable of parallel (multithreaded) compression with the gzip file format. This means the serial gzip is fully capable of decompressing files compressed with the multithreaded tools.
pigz
Probably one of the most popular multithreaded compression utilities is pigz , a threaded implementation of gzip that has saved my bacon, pun intended, on more than one occasion, because serial compression would take too long.
The pigz utility has options very similar to gzip , but I haven’t really exercised all of them. One option particular to pigz is -p # (where # is the number of threads). Without the space between the -p and the number of threads, pigz would try to use all of the processors and threads possible. For example, in the command
$ pigz -9 -v -p 8 file.dat
the flags -9 and -v indicate maximum compression and verbose output.
Unfortunately, the decompression utility, unpigz , is not parallelized. Keep this in mind if decompression time is a factor. In addition to unpigz , you can decompress a gzip - compressed file with gunzip .
I tried another gzip - compatible tool, pgzip , which is supposed to have both threaded compression and decompression capabilities. It is written in Go and appears to be just a library of routines that you use in your own Go code. Given that I know zero Go language, I tried a few quick hacks, but I could never get anything to work correctly.
A third multithreaded gzip - compatible tool is pugz . Although I could build pugz , it core dumped when compressing a simple 200-line text file, so for the time being, you can use pigz for parallel compression of files into the gzip format and either gunzip or unpigz to decompress the file; however, both are serial tools.
Parallel xz
The threaded utilities mentioned in this section all use the XZ file format. The compression utility xz is, itself, threaded. The -T # flag specifies the number of threads (e.g., -T 4 uses four threads). Note the space between the switch and the number of threads. A downside to xz is that the decompression command, unxz , is not multithreaded. You can still decompress .xz files with unxz , but it is run serially rather than in parallel.
pixz
The pixz utility is a parallel version of xz and uses the XZ file format. By default, it uses all possible cores. However, you can specify the number of threads with a command-line argument:
$ pixz -p 4 file.data
Notice the space between the switch and the number of threads. This example uses four threads for compression.
Unlike xz , pixz has a parallel decompress capability with the -d flag:
$ pixz -d -p 4 file.data.xz
As an aside, I built pixz from source. I had to make sure that libarchive and the associated dev components were installed, as well as liblzma and its associated dev components. An important safety tip is always to nread the list of dependencies before compiling.
pxz
The threaded compression utility pxz is also compatible with xz . You can specify the number of threads with the -T# option, where # is the number of threads (Listing 1). Notice that no space falls between the -T and the number of threads.
Listing 1: pxz
$ pxz -v -T4 package-list.txt context size per thread: 25169920 B package-list.txt -> 4/1 threads: [0 ] 11626 -> 4444 38.225%
Unfortunately, I did not see a decompress option, so you would likely have to use unxz or even pixz to decompress.
Parallel zstd
The next group of utilities use the ZSTD file format. The zstd tool has a multithread capability with the -T# command-line option, where # is the number of threads. (Note that no space falls between the switch and the number of threads.) Both the compression and decompression capabilities should be multithreaded.
pzstd
The parallel version of zstd is pzstd . On my Ubuntu 20.04 system, pzstd was also installed when zstd was installed. However, because zstd itself is multithreaded, I’m not sure how it differs from pzstd .
Compressing a file with pzstd is done in the same manner as with zstd . The command in Listing 2 tells pzstd to use four threads (-p 4 ).
Listing 2: pzstd
$ pzstd -v -p 4 package-list.txt Chosen frame size: 8388608 package-list.txt : 41.97% ( 11626 => 4880 bytes, package-list.txt.zst)
Unlike zstd , pzstd uses a space between the flag and the number of threads. To decompress the file, you can use either unzstd or:
$ pzstd -d -p 4 package-list.txt.zstd
Note that the decompress command can use multiple threads, and you don’t have to use the same number of threads to decompress as were used to compress.
Parallel lrzip
The lrzip utility has multithread capability. You can specify the number of threads with the command-line argument -p # , where # is the number of threads. Notice the space between the switch and the number of threads.
Parallel bzip2
The next set of utilities is multithreaded and uses the BZIP2 file format. The first parallel utility is lbzip2 .
lbzip2
The threaded compression tool lbzip2 is compatible with bzip2 . Multiple threads are specified with the option -n # , where # is the number of threads and a space falls between it and the switch (Listing 3).
Listing 3: lbzip2
$ lbzip2 -v -n 4 package-list.txt lbzip2: compressing "package-list.txt" to "package-list.txt.bz2" lbzip2: "package-list.txt": compression ratio is 1:2.803, space savings is 64.33%
You can then decompress the file with bzip2 , or it can be decompressed with lbzip2 :
$ lbzip2 -d -n 64 package-list.txt.bz2
The -d command-line option decompresses the file. Note that you do not have to use the same number of threads to decompress as you did to compress a file. I used four threads to compress and 64 threads to decompress.
pbzip2
Another threaded compression tool compatible with bzip2 is pbzip2 , which uses pthreads as the threading mechanism. Because it is fully compatible with bzip2 , you could compress with pbzip2 and decompress with bzip2 if you wanted.
The use of pbzip2 is just like any of the other threaded compression utilities, by specifying the number of thread with the -p# option (no space between the switch and the number of threads). The example in Listing 4 includes the verbose (-v ) flag, maximum compression on the file chunks, and four threads.
Listing 4: pbzip2 Compression
$ pbzip2 -9 -p4 -v package-list.txt Parallel BZIP2 v1.1.13 [Dec 18, 2015] By: Jeff Gilchrist [http://compression.ca] Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com] Uses libbzip2 by Julian Seward # CPUs: 4 BWT Block Size: 900 KB File Block Size: 900 KB Maximum Memory: 100 MB ------------------------------------------- File #: 1 of 1 Input Name: package-list.txt Output Name: package-list.txt.bz2 Input Size: 11626 bytes Compressing data (no threads)... Output Size: 4151 bytes ------------------------------------------- Wall Clock: 0.002442 seconds
As mentioned previously, you can decompress the file with bzip2 or pbzip2 , both of which can do multithreaded decompression. The next example uses pbzip2 with the -d command-line flag to decompress the file (Listing 5).
Listing 5: pbzip2 Decompression
$ pbzip2 -p4 -v -d package-list.txt.bz2 Parallel BZIP2 v1.1.13 [Dec 18, 2015] By: Jeff Gilchrist [http://compression.ca] Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com] Uses libbzip2 by Julian Seward # CPUs: 4 Maximum Memory: 100 MB Ignore Trailing Garbage: off ------------------------------------------- File #: 1 of 1 Input Name: package-list.txt.bz2 Output Name: package-list.txt BWT Block Size: 900k Input Size: 4151 bytes Decompressing data (no threads)... ------------------------------------------- Wall Clock: 0.001038 seconds
Parallel lzip
The utility plzip is a multithreaded version of lzip and is compatible with lzip , so you can decompress a file with lzip that you compressed with plzip . To specify the number of threads, use:
$ plzip -v -9 -n 32 package-list.txt package-list.txt: 2.640:1, 37.88% ratio, 62.12% saved, 11626 in, 4404 out.
The -n 32 option tells plzip to use 32 threads to perform the compression. I also included the -9 option for maximum compression and the -v option for verbose output.
To decompress a file, you just add the -d option. It, too, can be a threaded operation with -n # . However, if the file was compressed with lzip , specifying more than one thread won’t improve the decompression time.
$ plzip -v -9 -n 32 -d package-list.txt.lz package-list.txt.lz: done
Note that you do not have to use the same number of threads to decompress as was used to compress the file, but if you decompress a file compressed with lzip , only one thread will be used.
Parallel and Multinode
The compression tools mentioned so far are multithreaded, allowing you to use all of the cores on the system, but are limited to use on one node. Given the huge number of cores per socket available today, this can be extremely useful. However, one compression tool that I’m aware of goes beyond multithreading with the use of MPI as the communication protocol, allowing multinode compression, which could give you large amounts of computational capability for compressing really large files.
mpibzip2
The mpibzip2 utility is exactly as it sounds: the bzip2 algorithm and file format with MPI communication between compressing the chunks. It usually doesn’t come prebuilt for Linux distributions, so you need to download the source and build it with the mpic++ wrapper that usually comes with MPI implementations.
For this article, I installed the prebuilt Open MPI binaries for building mpibzip2 then just typed make to build the code.You run mpibzip2 like any other MPI code:
$ mpirun -np 2 ./mpibzip2 -9 file.txt
For this example, I just used the local node and didn’t test on a distributed system (hence no host file). I chose to use two processes (-np 2 ) and maximum compression (-9 ).
One gotcha with mpibzip2 is that you need to specify at least two processes (-np 2 ); otherwise, it will error out. In reality, if you are only using one process, you could just run bzip2 , so this “restriction” makes sense.
I’m also assuming that for mpibzip2 to work correctly, you need some sort of shared filesystem if you run the command in a distributed manner. Although this assumption seems logical, it was never explicitly stated.
The mpibzip2 utility is compatible with bzip2 , so if you compressed a file with mpibzip2 , you can decompress it with bzip2 . If decompression performance is important, you can use mpibzip2 to decompress a file with the -d (decompress) flag.
Encryption
Some of the compression tools mentioned in this article and the previous article can also encrypt files as they are being compressed. Not all utilities have this ability, although you can use them with gpg to encrypt while compressing files.
I am not an encryption expert, so I cannot judge the strength of the encryption algorithms used. I suggest doing your own reading of the algorithms and judge for yourself.
A general method for encrypting compressed files uses the standard *nix pipes, as the example of gzip used along with gpg shows in Listing 6.
Listing 6: gzip and gpg
$ gzip -9 -c package-list.txt | gpg --symmetric --cipher-algo aes256 -o secure.gz.gpg gpg: keybox '/home/laytonjb/.gnupg/pubring.kbx' created
The command line pipes the output of gzip , a compressed file, to gpg , which does the encryption. gpg prompts twice for a passphrase, so don’t forget the passphrase or you won’t be able to recover the file (unless you can crack it). In the case of gzip , if you did not use the -c flag, which writes the output to stdout, you will have a compressed version of the base file package-list.txt , as well as the compressed encrypted file secure.gz.gpg , which you may or may not want.
zip and Encryption
The zip compression utility can encrypt a file as part of compression with the -e (--encrypt ) flag as part of the command. When it encrypts the file, it will prompt you for a password, or you can include the password on the command line with the -P < password> (--password < password>) flag. Note that specifying the password on the command line is insecure, so it is not recommended (please read the man page for zip ).
The zip command in Listing 7 compresses and encrypts a file. Notice that it prompts twice for a password.
Listing 7: zip : Compress and Encrypt
$ zip package-list.txt.zip -v -e package-list.txt Enter password: Verify password: adding: package-list.txt (in=11626) (out=4522) (deflated 61%) total bytes=11626, compressed=4534 -> 61% savings
There appears to be some controversy about encryption with zip . I would suggest reading about this to understand what algorithms are used in the current zip tool.
7-Zip and Encryption
The 7z utility encrypts with a 256-bit Advanced Encryption Standard (AES) cipher. The encryption works for both the files themselves and the file hierarchy in the archive. (Recall that 7z can create archives containing multiple files and file hierarchies.) Encrypting both provides more security than just encrypting the files.
To create an archive of compressed files with 7z that are encrypted, use the command in Listing 8. The -p flag indicates that a password should be used to encrypt. Notice that 7z prompts twice for a password.
Listing 8: 7z Encrypted Archive
$ 7z a -p -mx=9 -mhe data.7z package-list.txt 7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,64 CPUs AMD Ryzen Threadripper 3970X 32-Core Processor (830F10),ASM,AES-NI) Scanning the drive: 1 file, 11626 bytes (12 KiB) Creating archive: data.7z Items to compress: 1 Enter password (will not be echoed): Verify password (will not be echoed) : Files read from disk: 1 Archive size: 4600 bytes (5 KiB) Everything is Ok
The final archive is named data.7z . To decompress and decrypt it, you just extract or decompress the archive (Listing 9). It will prompt you for the archive password.
Listing 9: 7z Decompress and Decrypt
$ 7z e data.7z 7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,64 CPUs AMD Ryzen Threadripper 3970X 32-Core Processor (830F10),ASM,AES-NI) Scanning the drive for archives: 1 file, 4600 bytes (5 KiB) Extracting archive: data.7z Enter password (will not be echoed): -- Path = data.7z Type = 7z Physical Size = 4600 Headers Size = 200 Method = LZMA2:12k 7zAES Solid = - Blocks = 1 Everything is Ok Size: 11626 Compressed: 4600
Encryption with lrzip
The compression tool lrzip also can encrypt files. From the man pages, it uses a SHA-512 repetitive hashing of the password along with a salt to provide a key used by AES-128 to do block encryption.
To compress and encrypt files with lrzip , use the command in Listing 10. Decompressing package-list.txt.lrz is the same as using lrunzip for decompression on a nonencrypted file (Listing 11).
Listing 10: lrzip Compress and Encrypt
$ lrzip -e package-list.txt Enter passphrase: Re-enter passphrase: Output filename is: package-list.txt.lrz package-list.txt - Compression Ratio: 2.537. Average Compression Speed: 0.000MB/s. Total time: 00:00:05.11
Listing 11: lrunzip Decompress and Decrypt
$ lrunzip package-list.txt.lrz Output filename is: package-list.txt Enter passphrase: Decompressing... Output filename is: package-list.txt: [OK] Total time: 00:00:03.77
Compression Tests
The obvious question that everyone has been waiting to ask is, “What is the best compression tool?” I can only answer that question with the classic answer: “It depends.” Truly there is no single "best” compression tool because people have different definitions of “best.” You should test the various utilities yourself on your data.
Also, you cannot find a single measure for determining the best compression tool. Getting the highest compression ratio, which creates the smallest file, is one aspect that users find important and can be particularly important for files that might not be decompressed for a long time.
Another important metric is compression time. How long does it take to compress a file? In some situations, the time it takes to compress a file is more important than the compression ratio.
Decompression speed can also be very important. If you are looking for something specific and must search through files, the time it takes to decompress and search through them is critical. In other situations, decompression speed is important for containers (e.g., Singularity, which uses SquashFS).
Underlying the metrics of compression ratio, speed of compression, and speed of decompression is the data type itself. Some data types (e.g., source code, text files) are very easy to compress, but files that are in binary form (e.g., audio, video, images) are more difficult to compress. Other data types tend to fall between these two guideposts.
With enough googling, you will find several articles that use various metrics to compare various compression codes and compression tool performance, and you will find tests that use different kinds of data or a distribution of date types. Some of these tests contradict each other in general conclusions, which is usually caused by different data types (be sure to read the fine print about the tests and the data types).
When I first started with *nix systems many years ago, data storage was very limited. When compiling applications, I would quickly erase any intermediate files (typically object files) and either erase or compress data files as soon as I could. After I finished working and before I went to class or went home, I compressed everything.
When I got my own computer with my own hard drive, I was still stubborn with storage space because I was paying for it, and it wasn't cheap (I was in graduate school). So, I followed the same rules: Get rid of unneeded files when you don’t need them anymore; compress or erase data files when you are finished using them; and before shutting down, compress everything you were working on.
These habits are part of me today, so any recommendations I give are reflected through this mirror. That said, here are my general recommendations:
- I use gzip -9 for most of my everyday compression needs. For me, the compression ratio is more important than the time it takes to compress or decompress, so I typically use this maximum compression.
- If files are large, then I switch to bzip2 -9 for files that are unlikely to be decompressed for months.
- For even larger files, I use a parallel compression tool with maximum compression. Classically, I've used pigz and pbzip2 .
- Lately, I have become interested in zstd because it focuses on decompression speeds regardless of the compression levels used and makes compression tools more likely to be used in everyday I/O because decompression is so fast.
- On the other hand, I’m also interested in lrzip because of its ability to use “long distance” redundancies and perhaps compress binary files a bit more than other tools.
Summary
I’ve enjoyed writing these two articles on compression because it allowed me to revisit my old data compression habits and how they influenced what I do today. It’s nice to see that developers are not giving up on compression tools as a “solved” problem but are using newer ideas and combinations of ideas to provide extra functionality.
With so many cores available and with files getting larger, parallel compression tools were inevitable. In this article, I presented a number of parallel compression tools that use various file formats and compression algorithms.