S3QL filesystem for cloud backups

Cloud Storage

Using S3QL

You should definitely read through the User's Guide [25], which explains the steps for using S3QL and has some good ideas and tips for its use. Be sure to read the section titled "Important Rules to Avoid Losing Data" before using S3QL. As an exercise, I decide to use S3QL against a local filesystem on a spare drive I had in my system (a simple ext4 filesystem on a drive mounted as /mnt/data1).

The first step is to build the S3QL filesystem on the mounted filesystem (Listing 1). First, notice that I created the file as root. Second, the prefix for the local filesystem is local://. You need to include both forward slashes in this prefix. The path after the prefix is /mnt/data1, where you need to include the first forward slash (absolute path). Thus, you get the local filesystem as local:///mnt/data1. Also notice that during filesystem creation, you will be prompted to enter the passphrase for encryption. Do not lose this passphrase, or you won't be able to decrypt your data. At the same time, don't do anything silly like writing it down. I'm not a security expert, but you should use a fairly long passphrase and make it easy to remember; on the other hand, don't use easy-to-find phrases or dictionary words.

Listing 1

Build S3QL Filesystem

01 [root@home4 ~]# mkfs.s3ql local:///mnt/data1
02 Before using S3QL, make sure to read the user's guide, especially
03 the 'Important Rules to Avoid Loosing Data' section.
04 Enter encryption password:
05 Confirm encryption password:
06 Generating random encryption key...
07 Creating metadata tables...
08 Dumping metadata...
09 ..objects..
10 ..blocks..
11 ..inodes..
12 ..inode_blocks..
13 ..symlink_targets..
14 ..names..
15 ..contents..
16 ..ext_attributes..
17 Compressing and uploading metadata...
18 Wrote 0.00 MiB of compressed metadata.

After creating an S3QL filesystem, I checked the mountpoint, /mnt/data1 (Listing 2). A few files are created as a result of the S3QL filesystem creation process. You can see these directories in the listing.

Listing 2

Check Mountpoint

01 [root@home4 ~]# ls -lstar /mnt/data1
02 total 36
03 16 drwx------  2 root root 16384 Nov 10 10:00 lost+found
04  4 drwxr-xr-x. 3 root root  4096 Nov 10 10:00 ..
05  4 -rw-r--r--  1 root root   294 Nov 10 10:07 s3ql_passphrase
06  4 -rw-r--r--  1 root root   243 Nov 10 10:07 s3ql_seq_no_1
07  4 -rw-r--r--  1 root root   556 Nov 10 10:07 s3ql_metadata
08  4 drwxr-xr-x  3 root root  4096 Nov 10 10:07 .

Now, I'm ready to mount the filesystem using the mount.s3ql command at /mnt/s3ql (Listing 3). While mounting the filesystem, I was asked for my passphrase. Notice that I checked the mountpoint to see what was there. Because the file lost+found was present, I was confident that the filesystem had been mounted properly. Just to be sure, I checked using the mount command (Listing 4). Notice the last line in the output. The S3QL filesystem is mounted. I also checked the local mountpoint, /mnt/data1, to see if anything had changed (Listing 5).

Listing 3

Mount S3QL Filesystem

01 [root@home4 ~]# mount.s3ql local:///mnt/data1 /mnt/s3ql
02 Using 10 upload threads.
03 Enter file system encryption passphrase:
04 Using cached metadata.
05 Mounting filesystem...
06 [root@home4 data1]# ls -lstar /mnt/s3ql
07 total 0
08 0 drwx------ 1 root root 0 Nov 10 10:07 lost+found

Listing 4

Check Mounted Filesystem

01 [root@home4 ~]$ mount
02 /dev/sda1 on / type ext4 (rw)
03 proc on /proc type proc (rw)
04 sysfs on /sys type sysfs (rw)
05 devpts on /dev/pts type devpts (rw,gid=5,mode=620)
06 tmpfs on /dev/shm type tmpfs (rw)
07 /dev/md0 on /home type ext4 (rw)
08 none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
09 sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10 /dev/sdb1 on /mnt/data1 type ext4 (rw)
11 local:///mnt/data1 on /mnt/s3ql type fuse.s3ql (rw,nosuid,nodev)

Listing 5

Recheck Mountpoint

01 [root@home4 ~]# ls -lstar /mnt/data1
02 total 40
03 16 drwx------  2 root root 16384 Nov 10 10:00 lost+found
04  4 -rw-r--r--  1 root root   294 Nov 10 10:07 s3ql_passphrase
05  4 -rw-r--r--  1 root root   243 Nov 10 10:07 s3ql_seq_no_1
06  4 -rw-r--r--  1 root root   556 Nov 10 10:07 s3ql_metadata
07  4 drwxr-xr-x. 4 root root  4096 Nov 10 10:08 ..
08  4 -rw-r--r--  1 root root   262 Nov 10 10:10 s3ql_seq_no_2
09  4 drwxr-xr-x  3 root root  4096 Nov 10 10:10 .

After I mounted the filesystem, there is a new directory with a new sequence. To better understand how I might use S3QL, I decided to simulate making a copy of a user's subdirectory (replication). Because S3QL is mounted like a local filesystem, I could use the cp command to copy the data. The user is laytonjb; I decided to copy my Documents subdirectory to the S3QL filesystem (Listing 6).

Listing 6

Replication

01 [root@home4 ~]# mkdir /mnt/s3ql/laytonjb
02 [root@home4 ~]# cp -r /home/laytonjb/Documents/ /mnt/s3ql/laytonjb/
03 [root@home4 ~]# df -h
04 Filesystem            Size  Used Avail Use% Mounted on
05 /dev/sda1             109G   17G   87G  16% /
06 tmpfs                  16G  596K   16G   1% /dev/shm
07 /dev/md0              2.7T  192G  2.4T   8% /home
08 /dev/sdb1             111G  1.4G  104G   2% /mnt/data1
09 local:///mnt/data1    1.0T  1.7G 1023G   1% /mnt/s3ql

The first command creates a subdirectory for the user (remember, it's a local filesystem, so I can use the mkdir command). Then, I did a recursive copy to the mounted filesystem. (I probably should have used the -p option to preserve the time stamp and ownership information, but this was just a test.)

The copy took a little bit of time (remember it is compressed, de-duped, and encrypted – the holy trinity of data management). The original directory contained about 2.2GB of data, and after the copy, it looked like it was using roughly 1.7GB in S3QL. When I checked the directory listing, I saw all of my original files (a portion of this output is shown in Listing 7). Because I didn't use the -p option, the owner and group are changed to root, and the time is changed to when the data was copied.

Listing 7

Check Replication

01 [root@home4 ~]# ls -lstar /mnt/s3ql/laytonjb/Documents/
02 total 897781
03   5265 -rw-r--r-- 1 root root   5390660 Nov 10 10:25 PGAS Languages NCSA 2009.pdf
04     65 -rw-r--r-- 1 root root     65893 Nov 10 10:25 cavp-03.pdf
05      0 drwxr-xr-x 1 root root         0 Nov 10 10:28 FEATURES
06   4173 -rw-r--r-- 1 root root   4272584 Nov 10 10:28 KonigesA-2.pdf
07      0 drwxr-xr-x 1 root root         0 Nov 10 10:28 STORAGE088_2
08   7984 -rw-r--r-- 1 root root   8175616 Nov 10 10:28 CFD_SC07.ppt
09      0 drwxr-xr-x 1 root root         0 Nov 10 10:28 COLLECTL
10   1262 -rw-r--r-- 1 root root   1291267 Nov 10 10:28 00354107.pdf
11    248 -rw-r--r-- 1 root root    253651 Nov 10 10:28 intro.pdf
12     88 -rw------- 1 root root     89206 Nov 10 10:28 PV2009_601.pdf
13   1653 -rw------- 1 root root   1692410 Nov 10 10:28 aiaa.2006.0107.pdf
14   3466 -rw------- 1 root root   3548623 Nov 10 10:28 Kirby-14.pdf
15    485 -rw-r--r-- 1 root root    495791 Nov 10 10:28 aiaa-2007-4581.pdf
16 ...

It is interesting to look at the local filesystem mountpoint, /mnt/data1 (Listing 8). A few new directories are created as a result of the data copy, and I'll mention a few quick things about this in a moment.

Listing 8

Local Mountpoint

01 [root@home4 ~]# ls -sltar /mnt/data1
02 total 104
03 16 drwx------    2 root root 16384 Nov 10 10:00 lost+found
04  4 -rw-r--r--    1 root root   294 Nov 10 10:07 s3ql_passphrase
05  4 -rw-r--r--    1 root root   243 Nov 10 10:07 s3ql_seq_no_1
06  4 drwxr-xr-x.   4 root root  4096 Nov 10 10:08 ..
07  4 -rw-r--r--    1 root root   262 Nov 10 10:10 s3ql_seq_no_2
08  4 -rw-r--r--    1 root root   556 Nov 10 10:15 s3ql_metadata_bak_0
09  4 -rw-r--r--    1 root root   602 Nov 10 10:15 s3ql_metadata
10  4 -rw-r--r--    1 root root   262 Nov 10 10:23 s3ql_seq_no_3
11  4 drwxr-xr-x    4 root root  4096 Nov 10 10:25 .
12 56 drwxr-xr-x  831 root root 57344 Nov 10 10:31 s3ql_data_

S3QL includes a number of very useful tools. One of them, s3qlstat, can provide some really cool information about what is stored in S3QL (Listing 9). The output contains all kinds of useful information; for example, there are 20,135 directory entries (I assume files) that use 20,137 inodes and 9,283 data blocks.

Listing 9

s3qlstat

01 [root@home4 laytonjb]# s3qlstat /mnt/s3ql
02 Directory entries:    20135
03 Inodes:               20137
04 Data blocks:          9283
05 Total data size:      2171.80 MiB
06 After de-duplication: 1696.35 MiB (78.11% of total)
07 After compression:    1216.23 MiB (56.00% of total, 71.70% of de-duplicated)
08 Database size:        3.81 MiB (uncompressed)
09 (some values do not take into account not-yet-uploaded dirty blocks in cache)

One piece of information I'm interested in is that the original data, when re-duped, decompressed, and decrypted, is 2,171.80MiB. After de-duplication, that went to 1,696.35MiB (a space savings of about 22%) and, when compressed, is about 1,216.23MiB (56% of the original total or 71.7% of the de-duped total). It also shows the database size. I personally love this command for learning about the status of my replicated data.

Unmounting the S3QL filesystem is easy, but you need to use the s3ql command (Listing 10) and not the OS command, umount. Please note that the umount command has to flush the cache, so it won't necessarily return quickly (it depends on how much data is in the cache).

Listing 10

Unmounting S3QL

01 [root@home4 Documents]# umount.s3ql /mnt/s3ql/
02 [root@home4 Documents]# ls -lstar /mnt/data1
03 total 696
04  16 drwx------    2 root root  16384 Nov 10 10:00 lost+found
05   4 -rw-r--r--    1 root root    294 Nov 10 10:07 s3ql_passphrase
06   4 -rw-r--r--    1 root root    243 Nov 10 10:07 s3ql_seq_no_1
07   4 drwxr-xr-x.   4 root root   4096 Nov 10 10:08 ..
08   4 -rw-r--r--    1 root root    262 Nov 10 10:10 s3ql_seq_no_2
09   4 -rw-r--r--    1 root root    556 Nov 10 10:15 s3ql_metadata_bak_1
10   4 -rw-r--r--    1 root root    262 Nov 10 10:23 s3ql_seq_no_3
11  56 drwxr-xr-x  831 root root  57344 Nov 10 10:31 s3ql_data_
12 592 -rw-r--r--    1 root root 603068 Nov 10 15:10 s3ql_metadata
13   4 -rw-r--r--    1 root root    602 Nov 10 15:10 s3ql_metadata_bak_0
14   4 drwxr-xr-x    4 root root   4096 Nov 10 15:10 .
15 [root@home4 Documents]# df -h
16 Filesystem            Size  Used Avail Use% Mounted on
17 /dev/sda1             109G   17G   87G  16% /
18 tmpfs                  16G  596K   16G   1% /dev/shm
19 /dev/md0              2.7T  192G  2.4T   8% /home
20 /dev/sdb1             111G  1.5G  104G   2% /mnt/data1

After I unmounted the S3QL filesystem, I examined the local filesystem mountpoint. I couldn't see anything useful in the files, but I decided to look at the directory /mnt/data1/s3ql_data_ to see what was there, because it looked interesting (Listing 11; the output was very long, so I truncated it). A bunch of directories and files were binary; using more or cat didn't tell me anything (it was gibberish), so it looks like the encryption and compression worked.

Listing 11

/mnt/data1/s3ql_data

01 [root@home4 Documents]# ls -s /mnt/data1/s3ql_data_/
02 total 167780
03     4 100      4 466      4 832               12 s3ql_data_341      4 s3ql_data_671
04     4 101      4 467      4 833               36 s3ql_data_342      8 s3ql_data_672
05     4 102      4 468      4 834               44 s3ql_data_343     12 s3ql_data_673
06     4 103      4 469      4 835               16 s3ql_data_344    276 s3ql_data_674
07     4 104      4 470      4 836               56 s3ql_data_345    472 s3ql_data_675
08     4 105      4 471      4 837                4 s3ql_data_346   2216 s3ql_data_676
09 ...

You can explore some very interesting things about S3QL, such as its caching capability (by default, a directory named ~/.s3ql in the user's root directory). This can be exploited or even relocated somewhere else.

You can also explore other storage back ends. For example, setting up an Amazon S3 account isn't too difficult, and if you only back up a little bit of data, it's easy to afford when you are just learning. Be sure to read the section in the documentation on account authentication [26] and how you use your S3 login and password (or any storage back end requiring authentication).

Another option is a storage back end that doesn't even have to be exported via a protocol such as NFS. If you can access a system associated with the storage via ssh, you can use sshfs to access the S3QL file storage. (Read the section [27] on how to use sshfs to mount storage on your client.)

Once you've accessed the system, you can create an S3QL filesystem and use it like any other back-end storage. A point that may be lost on people is that S3QL is a filesystem like any other, local or network based.

You can use it as a backup target for something like rsync, or with rsnapshot [28] or RIBS [29], which both use rsync. S3QL just becomes the "rsync target," then you can use one of the S3QL back ends as you want for backups or data replication (disaster recovery). You can also use your favorite file-based backup or replication tool.

Summary

Increasing amounts of data are pushing the need for backups or data replication. With petabyte storage systems becoming very common, particularly for high-performance computing systems, it can be difficult to have enough on-site hardware for backup or replication operations. Why not use cloud storage for this? However, using cloud storage typically means using object-based storage, so how do you use the PUT, GET, DELETE, HEAD commands to make copies or backups?

S3QL, one of the tools for object-based storage for backups or rsync (replication), has huge potential. It has one of the most important features in today's climate: encryption.

Additionally, it has compression and de-dupe capabilities, as well as dynamic sizing. A really interesting aspect of S3QL is that it offers several back-end storage options: Amazon S3, Rackspace Cloud Files, OpenStack Swift, and Google Storage, as well as S3-compatible targets and local filesystems.

Don't forget that S3QL behaves just like a filesystem, so you can use the classic tools against it, including backup or replication tools such as rsync. Give S3QL a whirl – it has some really cool features.

Infos

  1. Amazon S3: http://aws.amazon.com/s3/
  2. Amazon S3 web console: http://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html
  3. AWS CLI: http://aws.amazon.com/cli/
  4. S3 command docs: http://docs.aws.amazon.com/cli/latest/reference/s3/index.html
  5. s3cmd: http://s3tools.org/s3cmd
  6. s3sync: http://s3sync.net/wiki
  7. S3 Transfer Engine: http://www.bbconsult.co.uk/Resources/AmazonS3TransferEngine.aspx
  8. TntDrive: http://tntdrive.com/
  9. S3 Browser: http://s3browser.com/
  10. CloudBerry Explorer: http://www.cloudberrylab.com/free-amazon-s3-explorer-cloudfront-IAM.aspx
  11. Bucket Explorer: http://www.bucketexplorer.com/
  12. S3Fox Organizer: http://www.s3fox.net/
  13. boto_rsync: https://github.com/seedifferently/boto_rsync
  14. Duplicity: http://duplicity.nongnu.org/index.html
  15. AWS SDKs: http://aws.amazon.com/code
  16. S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html
  17. S3 Quick Reference: http://awsdocs.s3.amazonaws.com/S3/latest/s3-qrc.pdf
  18. AWS size: http://tinyurl.com/l978djn
  19. S3QL: http://code.google.com/p/s3ql/
  20. sshfs: http://fuse.sourceforge.net/sshfs.html
  21. S3QL installation instructions: http://code.google.com/p/s3ql/wiki/Installation
  22. PUIAS repo: https://puias.math.ias.edu/
  23. Atomicorp repository: http://www.atomicorp.com/
  24. APSW: http://code.google.com/p/apsw/
  25. S3QL User's Guide: http://www.rath.org/s3ql-docs/index.html<
  26. S3QL authentication: http://www.rath.org/s3ql-docs/authinfo.html
  27. SSH back end: http://www.rath.org/s3ql-docs/tips.html#ssh-backend
  28. rsnapshot: http://www.rsnapshot.org/
  29. rsync incremenal backup script: http://sourceforge.net/projects/ribs/

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • HPC Cloud Storage

    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.

  • An open source object storage solution
    We introduce the MinIO high-performance object store, its key features and applications, and some performance tips.
  • Comparing Ceph and GlusterFS
    Many shared storage solutions are currently vying for users’ favor; however, Ceph and GlusterFS generate the most press. We compare the two competitors and reveal the strengths and weaknesses of each solution.
comments powered by Disqus