« Previous 1 2
Read-only file compression with SquashFS
Data Crush
Using SquashFS
Using SquashFS is not difficult, comprising only two steps. The first step is to create a filesystem image using the SquashFS tools. You can create an image of an entire filesystem, a directory, or even a single file. This image, then, can be mounted directly (if it is a device) or mounted using a loopback device (if it is a file).
The tool that creates the image is called mksquashfs
, which has a number of options that allow control over virtually all aspects of the image. The man page is not very long, and it's definitely worth a look at the various options. Any user can create an image of any part of their data they desire. However, mounting it requires root access (or at least sudo
access).
As an example, I'll take a directory (/home/laytonjb/20170502
) on my desktop where I have stored PDFs, ZIP files, and other bits of information and articles that I collect throughout the month (I'm a digital hoarder). I want to compress this directory and all its subdirectories and files. Then, I want to mount it read-only so I can access the information but still save some space.
Before compression the directory was about 358MB:
$ du -sh 358M .
The first step is to create the image file, which can be done by the user as long as the resulting image is stored somewhere the user has permission (Listing 2). Notice that the command gives a reasonable amount of output without being too verbose.
Listing 2
Creating a SquashFS Image File
$ time mksquashfs /home/laytonjb/20170502 /home/laytonjb/squashfs/20170502.sqsh Parallel mksquashfs: Using 4 processors Creating 4.0 filesystem on /home/laytonjb/squashfs/20170502.sqsh, block size 131072. [================================================-] 2904/2904 100% Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072 compressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 335196.73 Kbytes (327.34 Mbytes) 91.53% of uncompressed filesystem size (366234.01 Kbytes) Inode table size 8424 bytes (8.23 Kbytes) 50.01% of uncompressed inode table size (16846 bytes) Directory table size 2199 bytes (2.15 Kbytes) 63.72% of uncompressed directory table size (3451 bytes) Xattr table size 54 bytes (0.05 Kbytes) 100.00% of uncompressed xattr table size (54 bytes) Number of duplicate files found 1 Number of inodes 94 Number of files 93 Number of fragments 5 Number of symbolic links 0 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 1 Number of ids (unique uids + gids) 1 Number of uids 1 laytonjb (1000) Number of gids 1 laytonjb (1000)
I used the command defaults, which means a block size of 128KiB (131,072 bytes) and the use of gzip
to compress the data. In the output, SquashFS states that it was able to compress the data to 91.53% of its uncompressed size, or to 328MB (327.34MB).
Notice that I used the time
command to time how long it took to run the command. The results were:
real 0m7.675s user 0m29.074s sys 0m1.002s
This looks to be pretty fast for compressing 358MB of data (on an SSD).
The next step is to mount the SquashFS image as you would any other filesystem. Out of the box, root needs to do this because the user does not have access to the mount
command.
$ mount -t squashfs /home/laytonjb/squashfs/20170502.sqsh /home/laytonjb/20170502_new -o loop $ mount ... /home/laytonjb/squashfs/20170502.sqsh on /home/laytonjb/20170502_new type squashfs (ro,relatime,seclabel)
Now look at /home/laytonjb/20170502_new
to make sure everything is there and permissions are as expected (Listing 3). I can look at the files, and they are owned by me.
Listing 3
Viewing Mounted SquashFS Image
$ ls -lsat ... 830 -rw-r--r--. 1 laytonjb laytonjb 848854 Jun 10 13:58 mesos.pdf 535 -rw-r--r--. 1 laytonjb laytonjb 546505 Jun 10 13:58 Martins2003CSD.pdf 8803 -rw-r--r--. 1 laytonjb laytonjb 9013307 Jun 10 13:58 Hwang2012c.pdf ...
Optimization Study
The two major options you are likely to use are -comp [comp]
and -b [bsize]
. The first option allows you to specify the compression algorithm used (from the options listed earlier). The second option allows you to control the block size (from the default of 128KiB to the maximum of 1MiB). Larger block sizes can help improve the amount of compression.
The simple command that uses the LZMA compression and a 1MiB block size would be:
$ mksquashfs /home/laytonjb/20170502 /home/laytonjb/squashfs/20170502.sqsh -comp lzma -b 1048576
The directory I've used in the examples is full of PDF and ZIP files. I didn't expect it to compress too much, but I did get some compression. As an experiment, I tried all four compression techniques with the default block size, 128KiB, and the maximum block size, 1MiB (Table 1).
Table 1
Compression and Block Size
Compression Technique | Block Size | User Time | Compression |
---|---|---|---|
gzip
|
128KiB | 00:29.074 | 91.53% |
1MiB | 00:31.050 | 91.35% | |
lzo
|
128KiB | 01:36.262 | 92.31% |
1MiB | 01:47.967 | 92.08% | |
xz
|
128KiB | 03:14.064 | 90.49% |
1MiB | 03:47.730 | 88.71% | |
lzma
|
128KiB | 03:10.494 | 90.48% |
1MiB | 03:44.004 | 88.78% |
Pretty obviously, the fastest compression technique is Gzip, with little difference in the user time it took for either block size (two-second difference, or a little less than 10%). The large block size did give a very tiny bit of extra compression.
The xz
and lzma
algorithms result in the most compression and take the longest – much longer than gzip
– but even for the default block size, they can compress the data by about 10%. Using the largest block size, they can get a little more compression: a little over 11%.
You might scoff at 10%, but remember that the files are binary. If you have 100TB of data, 10% is 1TB. Not too bad. If you have 1PB, then 10% is 100TB, which is quite a bit of space.
Summary
Even though data storage has gotten inexpensive, data consumption grows at a faster rate than storage. I don't think I've ever heard anyone ask for less storage space. Finding ways to reduce the amount of data is a key function in the life of an HPC administrator.
One way to conserve space is to compress data that is not used very often. Although you can do this on a file-by-file basis, a better way is to collect all of the data into a single directory and create a compressed filesystem image. SquashFS is probably the best tool for the job, because it's very easy to use and comes with virtually every Linux distribution out there. Give it a try; you won't be disappointed.
Infos
- SquashFS: http://squashfs.sourceforge.net/
- Gzip: http://www.gzip.org/
- LZMA: https://en.wikipedia.org/wiki/Lempel-Ziv-Markov_chain_algorithm
- LZO: https://en.wikipedia.org/wiki/Lempel-Ziv-Oberhumer
- xz: https://en.wikipedia.org/wiki/Xz
- SquashFS blocks: https://lwn.net/Articles/305083/
- SquashFS caches: https://www.mjmwired.net/kernel/Documentation/filesystems/squashfs.txt
« Previous 1 2
Buy this article as PDF
(incl. VAT)