File Compression for HPC
I won’t bore you with some sort of pontification on the amount of data in the world. Rather, I want to point out that much of this data is “cold”; that is, it hasn’t been used or read in a long time. Because storage space costs real money, not doing everything possible to minimize this cold data is a waste of money.
One way to reduce the size of data is by compressing it. From a very high level perspective, compression looks for patterns within a file and replaces them with a much smaller representation. In some files it is difficult or impossible to find repeated patterns, so they cannot be compressed, but for other files, algorithms can easily find these patterns. Fundamentally, therefore, how much a file can be compressed varies by the specific type of data.
In this article, I present some common, and perhaps some uncommon, compression tools in Linux. It won’t be comprehensive, but I hope to hit the highlights for most admins and users.
Compression
Many compression tools have unique capabilities and a specific file format for writing the compressed data. Almost all of the time, this file format is published (open sourced), but on occasion, it is not published and a little reverse engineering has to take place.
Of course, you can decompress any compressed data file, getting it back to its original state; otherwise, what’s the point? However, data files can have a very long life. I worked on a project for which I had to access 20–25-year-old compressed data. Some of the specific compressed formats were not used any more or even documented, and I could find no obvious clues as to what tool was used to compress the data or how it could be decompressed. Therefore, it is extremely important to know what tool was used to compress the data, and the file format for the compressed file.
Data compression utilities typically work in one of two ways. The first is to compress files individually. If you tell the tool to compress a single file or a list of files, it compresses each file individually. For example, if you compress file.data, it will create a file named file.data.<compressed>, where <compressed> is a file extension that indicates the utility used.
The second way is to create a file archive of compressed files by passing options to the compression utility, including a list of files, which are then compressed and put into a single file (an archive). This method uses a single command, rather than the more traditional method of first archiving files with something like tar and then compressing the archive.
Data compression is not a “free” operation. It takes computational resources, CPU, and memory, as well as time to compress data. If you need that data at some point, you must decompress it, which also takes computational resources and time.
The design of a compression utility focuses on what is considered most important. For example,
- Is getting the most data compression the most important focus?
- Is quickly compressing a file the primary focus?
- Is decompressing the data the number one priority?
Some other considerations also go into the design, such as:
- Single file compression or archiving
- Flexibility to trade compression for speed
- The number of compression algorithm options
- Compression of certain file types
- Focus on compression of certain file sizes
All of these aspects feed into the design.
gzip
Gzip is probably the most common compression tool on Linux. Although its first release was in 1992, it is still in development, with the current version being 1.11 (perhaps illustrating that it is quite mature). Gzip uses the Deflate algorithm for compressing. Deflate uses a combination of Lempel-Ziv-Storer-Szymanski (LZSS) and Huffman coding and block-based compression with blocks up to 64KB.
Note that gzip should not be confused with zip; they are different compression tools.
When gzip compresses a file, it adds a .gz extension to the end of the file name. For example, file.data would become file.data.gz. If you run the decompress command gunzip, the file file.data.gz would become file.data. If you like, you can refer to this as “in-place” file compression.
Gzip has various options that allow you to compress files in the directory structure recursively. It also has an option for specifying the level of compression you want, from 1 to 9. A sample command would be,
$ gzip -9 file.data
which uses maximum compression. As the level of compression increases, the amount of time to perform the compression increases and the amount of memory used increases. Compress level -6 is the default and is a reasonable trade-off between compression level and the amount of time it takes to perform the compression.
Gzip will overwrite the original file with the compressed file. In the previous command, the file file.data is replaced with file.data.gz.
Although the gzip implementation on Linux typically compresses one file at a time, you can specify more than one file on the command line, and it will compress each file separately. Specifying the -r option recursively compresses files in a directory structure.
The -c option causes gzip to concatenate files together. For example, start by compressing a file with the -c option and redirect the output to a separate .gz file:
$ gzip -c package-list.txt > jeff.gz
This command compresses package-list.txt to package-list.txt.gz, but it also creates jeff.gz. Next, you can compress a second file and add it to the archive jeff.gz:
$ gzip -c report.pdf >> jeff.gz
This command doesn’t compress report.pdf to report.pdf.gz but compresses and adds the file to jeff.gz.
When you decompress the archive jeff.gz with
$ gunzip -c jeff.gz
it decompresses both files.
For many years, the tar utility and gzip were used in combination because the gzip implementation on Linux could create a file archive (not many people use the concatenate feature). Now, tar has the ability to extract .tar.gz files directly without the use of gzip.
zip
ZIP is really a compressed file format, but compression utilities that implement this format are many times just called zip. The format permits the use of several compression algorithms, but the Deflate algorithm is the most common. Classically, any compression code that uses the ZIP file format creates files ending in .zip, but almost always you can name the compressed file whatever you like. However, ending the file in .zip allows someone to know that the file was compressed in a .zip format.
A Linux utility, appropriately named zip, uses the ZIP file format. Zip allows you to create a compressed archive so you don’t have to use tar. To use the zip utility on Linux, compress a file with:
$ zip file.data.zip file.data
On the command line, you first specify the compressed file, which I chose to end in .zip, followed by the files you want to add to the Zip archive.
In this example, the Zip archive is file.data.zip of only one file: file.data. When you list more than one file name on the command line, you create a file archive (several files in one compressed file).
A recursive option (-r) follows a directory structure. Another recursive option (-R) allows you to specify a file pattern (e.g., *.txt).
Many of the Linux Zip utility options are listed in the man page.
The unzip “unarchive” or “decompress” utility decompresses a .zip file archive.
bzip2
The Bzip2 compression tool has become increasingly popular on Linux because it has a higher compression ratio (i.e., it creates smaller compressed files) than Gzip, particularly for text files such as code source files. However, the price of smaller compressed files is that more memory and more processing resources are used relative to something like Gzip.
To achieve the higher compression, bzip2 uses several layers of compression techniques. The key is the Burrow-Wheeler Transform (BWT) algorithm that works particularly well on strings found in source code and text documents.
Bzip2 is not capable of creating a file archive, so it’s a single-file compression tool only. However, if you use bzip2 in combination with tar, you can create compressed archives.
Bzip2 is simple to use and is similar to Gzip:
$ bzip2 file.data
This command compresses file.data and creates file.data.bz2. The original file is “overwritten”; that is, the original file is replaced with the compressed file and no longer exists. Notice that the utility uses the extension *.bz2, so you know it has been compressed with bzip2.
Even though you can’t create a compressed archive with bzip2, you can compress more than one file on the same command line:
$ bunzip2 file1.txt file2.txt
This command compresses the two files individually, creating file1.txt.bz2 and file2.txt.bz2. If you want to preserve the original file, you can use:
$ bunzip2 -c file.data > file.data.bz2
The -c option allows you to redirect the compressed file to file.data.bz2, which leaves file.data intact. You can also use the -k option to do the exact same thing, but with a simpler command line:
$ bunzip2 -k file.data
To get a little more information about how bzip2 achieves compression, you can add the -v (verbose) flag:
$ bzip2 -v package-list.txt package-list.txt: 2.801:1, 2.856 bits/byte, 64.30% saved, 11626 in, 4151 out.
As with gzip, bzip2 has compression levels ranging from 1 to 9. To use maximum compression, you put the compression level on the command line as a flag:
$ bunzip2 -9 file.data
The decompression tool for bzip2 is bunzip2.
7-Zip
The 7-Zip utility uses a different compression file format named 7z. The file format supports different compression techniques:
- LZMA
- LZMA2 (multithreaded support)
- Bzip2
- PPMd
- Deflate
The format also supports encryption and can be used as a file archive.
The Linux implementation of 7-Zip is named p7zip, which itself is wrapper for 7z that uses 7z. I prefer just to use 7z on the command line:
$ 7z a files.7z file.data
By default, 7z creates a compressed archive, even if you compress only a single file. First, you need to tell it that you want to create an archive with the a immediately after the command. After that option is the name of the archive. You can name it what you like, but it is recommended that you use a .7z extension to indicate it was created with 7z. After the archive name is the list of files you want to compress. In this case, it’s a single file.
Because 7z creates an archive, it does not overwrite the original file. Some people find this very useful, especially if you are going to move the archive to a different location.
When you give 7z a list of files to compress into a single archive, you can also give it a directory name, and it will compresses all files in that directory into a single archive. 7z also has a recursive option, -r, for walking through a directory structure.
You can specify the particular compression technique and the level of compression on the command line, as well:
$ 7z a -m0=lzma -mx=9 files.7z file.data
This command specifies the lzma compression method and compression level 9 (maximum).
Because 7z has many options, compression is potentially not as easy to direct as for gzip, zip, or bzip2; however, flexibility gives you much more control. A good example is when specifying maximum compression, as in the previous example.
The e option,
$ 7z e files.7z
extracts files from an archive.
xz
XZ is a set of utilities that implement the LZMA compression SDK, which is somewhat similar to the LZ77 algorithm, with generally better compression than bzip2 but still good decompression performance.
According to the bzip2 Wikipedia article, LZMA generally has higher compression ratios than bzip2 but is also slower. On the other hand, LZMA is faster in decompression than bzip2.
The xz utility is very common in almost all Linux distributions. By default, when compressing a file, xz adds a *.xz extension to the file, which indicates that the file was compressed with xz, because the utility has its own file format. If you chose to rename an xz file, it is recommended you keep the file extension.
As with gzip, xz compresses the file in place so that a file named file.data becomes file.data.xz. A simple example of compressing and decompressing a file is:
$ xz file.data $ unxz file.data.xz
XZ has many options, and I’ll cover just a few in this article. XZ is similar to Gzip in that it has various levels of compression, starting with 1 having the least amount of compression but being the fastest, to 9 having the most compression but being the slowest. You can specify this level of compression on the command line as an option:
$ xz -9 file.data
As with gzip, the default compression level is 6.
The -e (extreme) compression option will try to get even more compression, but usually at the cost of being slower. You can use it, as well, with the various levels of compression, as in the following example with maximum compression:
$ xz -9e file.data
The option -9e will try to compress file.data as much as it possibly can, but sometimes it won’t be the smallest, depending on the data in the file. Very likely these options will take the longest amount of time for compression.
XZ threads with the -T option, so you can take advantage of multiple cores:
$ xz -T0 file.data
The number 0 tells it to use as many threads as there are cores; otherwise, you can specify the exact number of threads you want to use with a non-zero number. Be sure you don’t specify more threads than you have cores unless each core can accommodate multiple threads, else you oversubscribe the cores, and the core runs much slower. Although xz is multithreaded, the decompressor, unxz, is not.
One last option you should probably use is -v, which produces verbose, although not overwhelming, output about the compression.
lz4
Some very good compression algorithms can achieve very high levels of compression, but they take an inordinate amount of time and computational resources. The lz4 algorithm focuses on a trade-off between speed and the level of compression. In a general sense, it has a compression ratio similar to the Deflate algorithm, but with compression speeds several times faster than Deflate.
An implementation of lz4 for Linux has been part of the Linux kernel since 3.11 and has been used by SquashFS since kernel version 3.19-rc1.
The lz4 utility is easy to use and fairly similar to the other compression utilities already discussed:
$ lz4 file.data
This command preserves the original file and creates a new file, file.data.lz4. Note that it adds the extension .lz4 to the compressed file, indicating the method of compression.
lz4 has 12 compression levels (1-12). For maximum compression, use:
$ lz4 -12 file.data
You can also enable ultrafast compression levels with a slightly modified compression option:
$ lz4 --fast=12 file.data
The decompression tool is unlz4.
zstandard
The zstandard or zstd algorithm was developed by Facebook and uses the lz4 compression technique mentioned earlier, with its own file format. Zstd was specifically written to focus on two things: compressing small data and decompressing very rapidly.
It has compression levels ranging from -7, which is the fastest but least compression, to 22, which is the slowest and greatest compression. According to the zstd site, the compression speeds vary greatly with the compression level, by about a factor of 20, but the decompress speed doesn’t vary much with the compression level.
The “normal” compression levels for zstd vary from 1 to 19, with the lower number offering less compression but faster compression times and the default compression level being 3. You can use compression levels 20 to 22 with an additional option. For even faster compression – but with probably small compression ratios – you can use negative numbers. The smaller the number, the less compression.
As with all the compression tools mentioned here, zstd is easy to use:
$ zstd package-list.txt package-list.txt : 41.60% ( 11626 => 4836 bytes, package-list.txt.zst)
By default, zstd gives you one line of information about what it did and does not overwrite the original file. In this example, it creates a new file, package-list.txt.zst. Also note that it uses the file extension .zst, indicating the method of compression.
If you want to overwrite the original file, you can use the --rm option.
$ zstd --rm package-list.txt package-list.txt : 41.60% ( 11626 => 4836 bytes, package-list.txt.zst)
To decompress a file, use the unzstd command:
$ unzstd package-list.txt.zst
To perform maximum compression on a file, use:
$ zstd -19 package-list.txt package-list.txt : 38.02% ( 11626 => 4420 bytes, package-list.txt.zst)
You can also use the --ultra flag to unlock higher compression levels up to 22,
$ zstd --ultra -22 package-list.txt package-list.txt : 38.02% ( 11626 => 4420 bytes, package-list.txt.zst)
(kind of like turning the amplifier up to 11).
lrzip
Con Kolivas is an anesthetist in Australia who is also a Linux kernel developer. He has been focused on desktop performance for many years, but he has also created a compression tool named lrzip. In contrast to zstd, which focuses on small file compression, lrzip focuses on compressing large files (>10-50MB).
lrzip is an extended version of rzip. Like gzip and bzip2, lrzip uses “long distance redundances,” which can theoretically allow for more compression of files by looking over larger portions of the file, if not the entire file. Typically, this means more memory use, but rzip uses some clever techniques to avoid increasing the amount of memory used during compression.
rzip uses a large window to encode large chunks of duplicated data that are long distances from each other within the file. It then uses the Burrows-Wheeler Transform, also used by bzip2, to compress the output from the large window encoding.
lrzip takes the rzip algorithm and adds LZMA, LZO, Deflate, Bzip2, and ZPAQ compression techniques to the large search window. By default, the LZMA algorithm is used. lrzip also adds encryption, so as you compress a file, you can encrypt it as well.
lrzip is a file compressor, not an archiver. Therefore, you will have to use tar in conjunction with lrzip to take many files, compress them, and put them into a single file.
The use of lrzip for simple compression is straightforward:
$ lrzip package-list.txt Output filename is: package-list.txt.lrz package-list.txt - Compression Ratio: 2.604. Average Compression Speed: 0.000MB/s. Total time: 00:00:00.04
By default, lrzip does not overwrite the original file. Notice that it also gives you a bit more information than other compression tools. You can compress multiple files with a single command. If you want to compress a directory, you can use the command lrztar [directory].
By default, lrzip adds a .lrz extension to the file. To decompress it, use the lrunzip command:
$ lrunzip file.lrz
lrzip has several useful options:
- -r recursively compresses a directory
- -v or -vv creates verbose output
- -b uses bzip2 compression
- -g uses gzip compression
- -l uses lzo compression
- -L n uses compression level n (1-9, with 7 being the default)
- -p n uses n threads
In general, lrzip can use quite a bit of memory, sometimes as much as it can get, to perform the compression. You can limit the maximum amount of RAM it can use with the -m flag.
lzip
The lzip compression utility uses LZMA as the compression algorithm. As with gzip, it has nine levels of compression (0-9).Compressing a file is simple enough:
$ lzip -9 file.data
This command replaces the original file with a compressed file. In this case, it is file.data.lz. Notice that the file extension .lz indicates the file was compressed with lzip.
You can keep the original file by just redirecting the lzip output to a file:
$ lzip -9 file.data > jeff.data.lz
To decompress a .lz file, use
$ lzip -d file.data.lz
with the -d option.
Summary
Compression tools are one of the unsung heroes of the HPC world. Many years ago when storage space was in short supply, you compressed everything as much as you could, so you didn’t have to erase anything. Data storage has become much more plentiful, but still seems like it is in short supply.
If you ask an HPC user if they use all their data, all of the time, the answer you get is likely to be “of course!”. However, if you were to scan their /home directory or other directories where they store data and examine the access dates of the files, I think you would find a different truth.
If you are running out of data storage, or it has gotten expensive, you might want to consider an updated data policy. It’s simple enough to sweep user directories with a script and examine the last access date of all files. If a file hasn’t been accessed in a long time (with a specific definition of “long”), then it can be compressed. Many might feel that this approach is a bit heavy handed, but it is not erasing the files, just compressing them.
As an experiment, just try collecting the last access time of a user’s data. I think you will be surprised at how long it has been since some files were accessed.