S3QL Filesystem for HPC Storage
Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.
Cloud computing provides computing resources and storage that can be accessed by any system connected to the Internet. Generally, this is in the form of virtual machines (VMs) and storage both in the VMs (direct attached) and over the network (typically block devices formatted and exported with NFS). Typically, users configure the VMs with their operating system and applications of choice, possibly configure some sort of storage, and start running. When they are done, the data is copied back to permanent storage (perhaps local).
However, this is not the only way to utilize cloud resources. In this article, I want to focus on the storage aspect of the cloud, which can be used for backups and duplicates of data on a large or small scale.
Amazon S3 Storage
Although several cloud storage options are available, I’m going to focus on Amazon S3 because, arguably, they are the one-thousand-pound gorilla in cloud storage. The exact details of S3 are not public, but you can think of it as an object-based storage system. To begin using S3, you create one or more buckets. In each bucket are objects, and you have no limits on the number of objects per bucket. Each object is a file and any associated metadata (e.g., ACLs). Currently each object can be up to 5 terabytes (TB) in size accompanied by up to 2KB of metadata. However, S3 can only work with 5GB files in a single write operation, so S3 breaks files larger than 5G into multiple pieces. You typically interact with the objects (files) with a few simple commands: write (PUT), read (GET), or delete (DELETE).
The usual method of interacting with S3 is via a web console. Amazon has also made a command-line interface (CLI) available that can be used to interact with S3. The basic command begins with aws s3; then, you can use the common Linux commands of cp, mv, and ls, as well as rm to remove objects (files) within a specified bucket. For example
$ aws s3 ls s3://mybucket
lists the objects in your bucket. You can also used a sync command to synchronize the objects in a bucket with a local directory. You can find S3 command documentation online.
Several open source projects, such as s3cmd, are very similar to Amazon’s CLI tools. With s3cmd, you can copy and move files and sync directories with your S3 bucket, and you can use s3sync to sync to S3 (somewhat like rsync).
If you are running Windows these Amazon S3 clients are available for backing up data to Amazon S3:
- Amazon S3 Transfer Engine is a free tool
- TntDrive has a free trial
- S3 Browser is free tool
- CloudBerry Explorer for Amazon S3 has a free version and a Pro version
- Bucket Explorer has a free trial period, but it is a commercial product
You can also find smartphone apps for interacting with S3, and you can even interact with S3 in your Firefox browser using S3Fox Organizer.
The Linux world also has tools for treating Amazon S3 as an rsync target. The best known tool is probably boto_rsync, a Python tool that uses rsync to sync from a local directory to an S3 bucket, as well as S3sync and Duplicity.
A number of Amazon SDK libraries can be used to build S3 applications, and S3 has an API that describes how to interact with the bucket(s) and objects in the bucket. The S3 API Quick Reference Card illustrates the functions used to interact with an object storage system.
- Bucket operations
- PUT <bucket>
- GET <bucket>
- GET <bucket location>
- DELETE <bucket>
- Object operations
- GET <object>
- PUT <object>
- COPY <object>
- HEAD <object>
- DELETE <object>
- POST <object>
Basically, you can write the object (PUT), read the object (GET), delete the object (DELETE), and retrieve information about the object (HEAD), which is infinitely more simple than POSIX, which has a long, long laundry list of functions for interacting with files.
Several projects have taken the object storage paradigm and mapped it to a conventional filesystem by using FUSE, so the filesystems are in userspace:
Notice I listed a number of “s3fs” projects, but they are different from each other. Overall, it is really interesting how developers have taken a very simple storage solution with basically three commands and mapped it to a classic filesystem. This approach has wonderful potential, because you can now treat object storage as a regular filesystem, allowing you to use the regular tools you use for backup and replication.
Object storage has been around for a while, but I argue that Amazon S3 popularized it. Subsequent object storage solutions use the same concepts of PUT, GET, and DELETE and have different implementations of them, as well as other features. While the Amazon cloud is large – five times larger than the next 14 cloud providers – cloud storage is growing rapidly with many providers. Wouldn’t it be nice to be able to use a different or several cloud storage providers from the same backup or replication tool?
One complicating factor is that you have to assume the data is accessible. Therefore, you need to start thinking about encrypting your data while encrypting the data transmission to your storage back end. Encryption of data should be a non-optional part of your data plans.
S3QL
One of the most interesting S3 backup/filesystem tools is s3ql. It creates a POSIX-like filesystem in user space (FUSE) using object storage or other storage as the target. Its major features are:
- Encryption: Thanks to Mr. Snowden, we now know that the US government has access to far more of our data and communications than we realized. Therefore, if I’m going to back up my data to cloud storage, I want to make sure the data is encrypted. S3QL encrypts all data using a 256-bit AES key. An additional SHA-256 HMAC checksum protects the data from manipulation.
- Compression: S3QL compresses the data before storing, using either LZMA, bzip2, or gzip. This compression takes place before the data is encrypted.
- De-duplication: The data to be stored can be de-duplicated if the files have identical content. It works across all files stored in the filesystem, but it can also work if only some parts of the files are identical (i.e,. not the entire file).
- Copy-on-Write/Snapshotting: S3QL can duplicate directory trees without the use of any additional storage space (target storage). If one of the copies is modified, then only the part of the data that has been modified will take up additional storage space. You can use this capability to create intelligent snapshots that preserve the state of a directory at different points in time using a minimum amount of space.
- Dynamic Size: The size of an S3QL filesystem can grow or shrink dynamically as required (or needed).
- Transparency: An S3QL filesystem behaves like a local filesystem, in that it supports hard links, symlinks, typical permissions, extended attributes (xattr) and file sizes up to 2TB.
- Immutable trees: With S3QL, directory trees can be made immutable so that their contents can’t be changed in any way whatsoever.
- Range of Back Ends: S3QL has a range of back ends (what I call targets, but which S3QL calls storage back ends), such as Google storage, Amazon S3, Amazon Reduced Redundancy Storage (RRS), OpenStack Swift, Rackspace Cloud Files, S3-compatible targets, local filesystems, and even filesystems accessed using sshfs which are treated as a local filesystems.
- Caching: You can configure a cache for S3QL to improve apparent performance, perhaps taking advantage of a local SSD for data caching.
S3QL has other aspects that I really like, and I’ll mention them throughout the remainder of the article.
Building and Installing S3QL
S3QL is primarily written in Python and comes in two versions: Python 2.7 (an older version of s3ql) and Python 3.3 (the more modern version). I chose to use the Python 2.7 version because s3ql has several dependencies and building it from scratch might not be the easiest thing to do. You need to pay careful attention to the Installation Instructions. In my case, I followed the installation instructions for CentOS on my CentOS 6.4 system. The instructions are very good and very accurate – be sure you read all of it before trying to install s3ql. Briefly, I’ll mention some highlights of the installation that I hope will help you.
To begin, turn off SELinux and reboot. You can find instructions on how to do this all over the web, but be sure the instructions are compatible with the security of your systems. After installing the dependencies, I followed the instructions for installing Python 2.7 from the PUIAS repo. In case you are wondering, PUIAS is a project of the members of Princeton University and the Institute for Advanced Studies (IAS). Although these credentials don’t make the site automatically safe, I did a little checking, and I think their repo is very safe. Conveniently, Python 2.7 was installed along side my existing Python. PUIAS even produces a distro called Springdale that might be worth checking out. (It looks to be based on CentOS 6.4.)
Next, I installed SQLite from the Atomicorp repo. I always like to check out repos that people suggest before I use them, and in this case I was satisfied enough to continue. The instructions had me install SQLite version 3.7.9: It is very important to remember this version number.
The next step was to install Python APSW, where APSW stands for “Another Python SQLite Wrapper.” You have to download the version of APSW that matches the version of SQLite you installed (see the previous step). This is critical because I tried several different versions of APSW that didn’t match the version of SQLite I installed, and the s3ql build always failed.
After installing a few more python packages, I built and installed S3QL itself. It was very easy to build, but when I ran the “test” as part of the installation (python2.7 setup.py test), I got an assertion error. I have no idea if this is expected behavior or not, but after testing S3QL for a few days, I didn’t run into any unexpected behavior. When you install S3QL, be sure to pay attention to where the various pieces are installed. They should be part of the standard path, but it’s always good to document where things are located.
Using S3QL
As an exercise, I decide to use S3QL against a local filesystem on a spare drive I had in my system (a simple ext4 filesystem on a drive mounted as /mnt/data1). You should definitely read through the User’s Guide, which explains the steps for using S3QL and has some good ideas and tips for its use. Be sure to read the section titled “Important Rules to Avoid Losing Data” before using S3QL.
The first step is to build the S3QL filesystem on the mounted filesystem:
[root@home4 ~]# mkfs.s3ql local:///mnt/data1 Before using S3QL, make sure to read the user's guide, especially the 'Important Rules to Avoid Loosing Data' section. Enter encryption password: Confirm encryption password: Generating random encryption key... Creating metadata tables... Dumping metadata... ..objects.. ..blocks.. ..inodes.. ..inode_blocks.. ..symlink_targets.. ..names.. ..contents.. ..ext_attributes.. Compressing and uploading metadata... Wrote 0.00 MiB of compressed metadata.
First, notice that I created the file as root. Second, the prefix for the local filesystem is local://. You need to include both forward slashes in this prefix. The path after the prefix is /mnt/data1, where you need to include the first forward slash (absolute path). Thus, you get the local filesystem as local:///mnt/data1.
Also notice that during filesystem creation, you will be prompted to enter the passphrase for encryption. Do not lose this passphrase, or you won’t be able to decrypt your data. At the same time, don’t do anything silly and write it down. I’m not a security expert, but use a fairly long passphrase, and make it easy to remember; on the other hand, don’t use easy-to-find phrases or dictionary words.
After creating an S3QL filesystem, I then checked the mountpoint, /mnt/data1:
[root@home4 ~]# ls -lstar /mnt/data1 total 36 16 drwx------ 2 root root 16384 Nov 10 10:00 lost+found 4 drwxr-xr-x. 3 root root 4096 Nov 10 10:00 .. 4 -rw-r--r-- 1 root root 294 Nov 10 10:07 s3ql_passphrase 4 -rw-r--r-- 1 root root 243 Nov 10 10:07 s3ql_seq_no_1 4 -rw-r--r-- 1 root root 556 Nov 10 10:07 s3ql_metadata 4 drwxr-xr-x 3 root root 4096 Nov 10 10:07 .
A few files are created as a result of the S3QL filesystem creation process. You can see these directories in the listing.
Now, I’m ready to mount the filesystem using the mount.s3ql command at /mnt/s3ql:
[root@home4 ~]# mount.s3ql local:///mnt/data1 /mnt/s3ql Using 10 upload threads. Enter file system encryption passphrase: Using cached metadata. Mounting filesystem... [root@home4 data1]# ls -lstar /mnt/s3ql total 0 0 drwx------ 1 root root 0 Nov 10 10:07 lost+found
While mounting the filesystem, I was asked for my passphrase. Noticed that I checked the mountpoint to see what was there. Since the file lost+found was present, I was confident that the filesystem had been mounted properly. Just to be sure, I checked using the mount command:
[root@home4 ~]$ mount /dev/sda1 on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/md0 on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/sdb1 on /mnt/data1 type ext4 (rw) local:///mnt/data1 on /mnt/s3ql type fuse.s3ql (rw,nosuid,nodev)
Notice the last line in the output. The S3QL filesystem is mounted. I also checked the local mountpoint, /mnt/data1, to see if anything had changed:
[root@home4 ~]# ls -lstar /mnt/data1 total 40 16 drwx------ 2 root root 16384 Nov 10 10:00 lost+found 4 -rw-r--r-- 1 root root 294 Nov 10 10:07 s3ql_passphrase 4 -rw-r--r-- 1 root root 243 Nov 10 10:07 s3ql_seq_no_1 4 -rw-r--r-- 1 root root 556 Nov 10 10:07 s3ql_metadata 4 drwxr-xr-x. 4 root root 4096 Nov 10 10:08 .. 4 -rw-r--r-- 1 root root 262 Nov 10 10:10 s3ql_seq_no_2 4 drwxr-xr-x 3 root root 4096 Nov 10 10:10 .
After I mount the filesystem, there is a new directory with a new sequence.
To better understand how I might use S3QL, I decided to simulate making a copy of a user’s subdirectory (replication). Since S3QL is mounted like a local filesystem, I could use the cp command to copy the data. The user is laytonjb (nothing like experimenting on yourself); I decided to copy my Documents subdirectory to the S3QL filesystem:
[root@home4 ~]# mkdir /mnt/s3ql/laytonjb [root@home4 ~]# cp -r /home/laytonjb/Documents/ /mnt/s3ql/laytonjb/ [root@home4 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 109G 17G 87G 16% / tmpfs 16G 596K 16G 1% /dev/shm /dev/md0 2.7T 192G 2.4T 8% /home /dev/sdb1 111G 1.4G 104G 2% /mnt/data1 local:///mnt/data1 1.0T 1.7G 1023G 1% /mnt/s3ql
The first command creates a subdirectory for the user (remember, it’s a local filesystem, so I can use the mkdir command). Then, I did a recursive copy to the mounted filesystem. (I probably should have used the -p option to preserve the time stamp and ownership information, but this was just a test.)
The copy took a little bit of time (remember it is compressed, de-duped, and encrypted – the holy trinity of data management). The original directory contained about 2.2GB of data, and after the copy, it looked like it was using roughly 1.7GB in S3QL. When I checked the directory listing, I saw all of my original files (a portion of this listing is below):
[root@home4 ~]# ls -lstar /mnt/s3ql/laytonjb/Documents/ total 897781 5265 -rw-r--r-- 1 root root 5390660 Nov 10 10:25 PGAS Languages NCSA 2009.pdf 65 -rw-r--r-- 1 root root 65893 Nov 10 10:25 cavp-03.pdf 0 drwxr-xr-x 1 root root 0 Nov 10 10:28 FEATURES 4173 -rw-r--r-- 1 root root 4272584 Nov 10 10:28 KonigesA-2.pdf 0 drwxr-xr-x 1 root root 0 Nov 10 10:28 STORAGE088_2 7984 -rw-r--r-- 1 root root 8175616 Nov 10 10:28 CFD_SC07.ppt 0 drwxr-xr-x 1 root root 0 Nov 10 10:28 COLLECTL 1262 -rw-r--r-- 1 root root 1291267 Nov 10 10:28 00354107.pdf 248 -rw-r--r-- 1 root root 253651 Nov 10 10:28 intro.pdf 88 -rw------- 1 root root 89206 Nov 10 10:28 PV2009_601.pdf 1653 -rw------- 1 root root 1692410 Nov 10 10:28 aiaa.2006.0107.pdf 3466 -rw------- 1 root root 3548623 Nov 10 10:28 Kirby-14.pdf 485 -rw-r--r-- 1 root root 495791 Nov 10 10:28 aiaa-2007-4581.pdf
Because I didn’t use the -p option, the owner and group are changed to root and the time is changed to when the data was copied. It is interesting to look at the local filesystem mountpoint, /mnt/data1:
[root@home4 ~]# ls -sltar /mnt/data1 total 104 16 drwx------ 2 root root 16384 Nov 10 10:00 lost+found 4 -rw-r--r-- 1 root root 294 Nov 10 10:07 s3ql_passphrase 4 -rw-r--r-- 1 root root 243 Nov 10 10:07 s3ql_seq_no_1 4 drwxr-xr-x. 4 root root 4096 Nov 10 10:08 .. 4 -rw-r--r-- 1 root root 262 Nov 10 10:10 s3ql_seq_no_2 4 -rw-r--r-- 1 root root 556 Nov 10 10:15 s3ql_metadata_bak_0 4 -rw-r--r-- 1 root root 602 Nov 10 10:15 s3ql_metadata 4 -rw-r--r-- 1 root root 262 Nov 10 10:23 s3ql_seq_no_3 4 drwxr-xr-x 4 root root 4096 Nov 10 10:25 . 56 drwxr-xr-x 831 root root 57344 Nov 10 10:31 s3ql_data_
A few new directories are created as a result of the data copy, and I’ll mention a few quick things about this in a moment.
S3QL includes a number of very useful tools. One of them, s3qlstat, can provide some really cool information about what is stored in S3QL:
[root@home4 laytonjb]# s3qlstat /mnt/s3ql Directory entries: 20135 Inodes: 20137 Data blocks: 9283 Total data size: 2171.80 MiB After de-duplication: 1696.35 MiB (78.11% of total) After compression: 1216.23 MiB (56.00% of total, 71.70% of de-duplicated) Database size: 3.81 MiB (uncompressed) (some values do not take into account not-yet-uploaded dirty blocks in cache)
The output contains all kinds of useful information; for example there are 20,135 directory entries (I assume files) that use 20,137 inodes and 9,283 data blocks. One piece of information I’m interested in is that the original data, when re-duped, decompressed, and decrypted, is 2,171.80MiB. After de-duplication, that went to 1,696.35MiB (a space savings of about 22%) and, when compressed, is about 1,216.23MiB (56% of the original total or 71.7% of the de-duped total). It also shows the database size. I personally love this command for learning about the status of my replicated data.
Unmounting the S3QL filesystem is easy, but you need to use the s3ql command and not the OS command, umount. Please note that the umount command has to flush the cache, so it won’t necessarily return quickly (it depends on how much data is in the cache):
[root@home4 Documents]# umount.s3ql /mnt/s3ql/ [root@home4 Documents]# ls -lstar /mnt/data1 total 696 16 drwx------ 2 root root 16384 Nov 10 10:00 lost+found 4 -rw-r--r-- 1 root root 294 Nov 10 10:07 s3ql_passphrase 4 -rw-r--r-- 1 root root 243 Nov 10 10:07 s3ql_seq_no_1 4 drwxr-xr-x. 4 root root 4096 Nov 10 10:08 .. 4 -rw-r--r-- 1 root root 262 Nov 10 10:10 s3ql_seq_no_2 4 -rw-r--r-- 1 root root 556 Nov 10 10:15 s3ql_metadata_bak_1 4 -rw-r--r-- 1 root root 262 Nov 10 10:23 s3ql_seq_no_3 56 drwxr-xr-x 831 root root 57344 Nov 10 10:31 s3ql_data_ 592 -rw-r--r-- 1 root root 603068 Nov 10 15:10 s3ql_metadata 4 -rw-r--r-- 1 root root 602 Nov 10 15:10 s3ql_metadata_bak_0 4 drwxr-xr-x 4 root root 4096 Nov 10 15:10 . [root@home4 Documents]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 109G 17G 87G 16% / tmpfs 16G 596K 16G 1% /dev/shm /dev/md0 2.7T 192G 2.4T 8% /home /dev/sdb1 111G 1.5G 104G 2% /mnt/data1
Notice that after I unmounted the S3QL filesystem, I examined the local filesystem mountpoint. I couldn’t see anything useful in the files, but I decided to look at the directory /mnt/data1/sql_data_ to see what was there, since it looked interesting:
[root@home4 Documents]# ls -s /mnt/data1/s3ql_data_/ total 167780 4 100 4 466 4 832 12 s3ql_data_341 4 s3ql_data_671 4 101 4 467 4 833 36 s3ql_data_342 8 s3ql_data_672 4 102 4 468 4 834 44 s3ql_data_343 12 s3ql_data_673 4 103 4 469 4 835 16 s3ql_data_344 276 s3ql_data_674 4 104 4 470 4 836 56 s3ql_data_345 472 s3ql_data_675 4 105 4 471 4 837 4 s3ql_data_346 2216 s3ql_data_676
The output was very long, so I truncated it. A bunch of directories and files were binary; using more or cat didn’t tell me anything (it was gibberish), so it looks like the encryption and compression worked.
There are a number of really interesting things about S3QL to explore, such as its caching capability (by default, a directory named ~/.s3ql in the user’s root directory). This can be exploited or even relocated somewhere else.
You can explore using other storage back ends. For example, setting up an Amazon S3 account isn’t too difficult, and if you only back up up a little bit of data, it’s easy to afford when you are just learning. Be sure to read the section on account authentication and how you use your S3 login and password (or any storage back end requiring authentication).
Another option for a storage back end is one that doesn’t even have to be exported via a protocol such as NFS. If you can access a system associated with the storage via ssh, you can use sshfs to access the S3QL file storage. (Read the section on how to use sshfs to mount storage on your client.) Once you’ve accessed the system, you can create an S3QL filesystem and use it like any other back-end storage.
A point that may be lost on people is that S3QL is a filesystem like any other, local or network based. You can use it as a backup target for something like rsync, or with rsnapshot or RIBS, which both use rsync. S3QL just becomes the “rsync target,” then you can use one of the S3QL back ends as you want for backups or data replication (disaster recovery). You can also use your favorite file-based backup or replication tool.
Summary
Increasing amounts of data are pushing the need for backups or data replication. With petabyte storage systems becoming very common, particularly for HPC systems, it can be difficult to have enough on-site hardware for backup or replication operations. Why not use cloud storage for this? However, using cloud storage typically means using object-based storage, so how do you use the PUT, GET, DELETE, HEAD commands to make copies or backups?
S3QL, one of the tools for object-based storage for backups or rsync (replication), has huge potential. It has one of the most important features in today’s climate, encryption. Additionally, it has compression and de-dup capabilities, as well as dynamic sizing. One of the really cool aspects of S3QL is its several back-end storage options: Amazon S3, Rackspace Cloud Files, OpenStack Swift, and Google Storage, as well as S3-compatible targets and local filesystems.
Don’t forget that S3QL behaves just like a filesystem, so you can use the classic tools against it, including backup or replication tools such as rsync. Give S3QL a whirl – it has some really cool features.