Lead Image © stylephotographs, 123RF.com

Lead Image © stylephotographs, 123RF.com

S3QL filesystem for cloud backups

Cloud Storage

Article from ADMIN 18/2013
By
Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.

Cloud computing provides computing resources and storage that can be accessed by any system connected to the Internet. Generally, this takes the form of virtual machines (VMs) and storage both in the VMs (direct attached) and over the network (usually block devices formatted and exported with NFS). Typically, users configure the VMs with their operating system and applications of choice, perhaps configure some storage, and start running. After that, the data is copied back to permanent storage (perhaps local). However, this is not the only way to utilize cloud resources. In this article, I'll focus on the storage aspect of the cloud, which can be used for backups and duplicates of data on a large or small scale.

Amazon S3 Storage

Although several cloud storage options are available, I'll focus on Amazon S3 [1] because, arguably, Amazon is the thousand-pound gorilla in cloud storage. The exact details of S3 are not public, but you can think of it as an object-based storage system. To begin using S3, you create one or more buckets. Each bucket contains objects, and you have no limits on the number of objects per bucket. Each object is a file and any associated metadata (e.g., ACLs).

Currently, each object can be up to 5 terabytes (TB) in size accompanied by up to 2KB of metadata. However, S3 can only work with 5GB files in a single write operation, so S3 breaks files larger than 5GB into multiple pieces. You typically interact with the objects (files) with a few simple commands: write (PUT), read (GET), or delete (DELETE). The usual method of interacting with S3 is via a web console [2]. Amazon also offers a command-line interface (CLI) [3] that can be used to interact with S3. The basic command begins with aws s3; then, you can use the common Linux commands of cp, mv, ls, and rm to manipulate objects (files) in a specified bucket. For example,

$ aws s3 ls s3://mybucket

lists the objects in your bucket. You can also used a sync command to synchronize the objects in a bucket with a local directory. You can find S3 command documentation online [4].

Several open source projects, such as s3cmd [5], are very similar to Amazon's CLI tools. With s3cmd, you can copy and move files and sync directories with your S3 bucket, and you can use s3sync [6] to sync to S3 (somewhat like rsync). If you are running Windows, these Amazon S3 clients are available for backing up data to Amazon S3:

  • Amazon S3 Transfer Engine [7] is a free tool.
  • TntDrive [8] offers a free trial.
  • S3 Browser [9] is a free tool.
  • CloudBerry Explorer for Amazon S3 [10] has a free version and a Pro version.
  • Bucket Explorer [11] offers a free trial period, but it is a commercial product.

You can also find smartphone apps for interacting with S3, and you can even interact with S3 in your Firefox browser using S3Fox Organizer [12]. The Linux world also has tools for treating Amazon S3 as an rsync target. The best-known tool is probably boto_rsync [13], a Python tool that uses rsync to sync from a local directory to an S3 bucket, as well as s3sync and Duplicity [14].

A number of Amazon SDK libraries [15] can be used to build S3 applications, and S3 has an API [16] that describes how to interact with the bucket(s) and objects in the bucket (see the box "Operations."). The S3 API Quick Reference Card [17] illustrates the functions used to interact with an object storage system.

Operations

Bucket operations

  • PUT <bucket>
  • GET <bucket>
  • GET <bucket location>
  • DELETE <bucket>

Object operations

  • GET <object>
  • PUT <object>
  • COPY <object>
  • HEAD <object>
  • DELETE <object>
  • POST <object>

Basically, you can write the object (PUT), read the object (GET), delete the object (DELETE), and retrieve information about the object (HEAD), which is infinitely more simple than POSIX, which has a long, long laundry list of functions for interacting with files. Several projects have taken the object storage paradigm and mapped it to a conventional filesystem by using FUSE, so the filesystems are in userspace:

  • s3backer
  • s3fs (FuseOverAmazon)
  • s3-simple-fuse
  • S3QL
  • yas3fs
  • s3fs-c
  • s3fs (Fedora hosted)
  • s3fs-fuse

Notice I listed a number of "s3fs" projects, but they are different from each other. Overall, it is really interesting how developers have taken a very simple storage solution with basically three commands and mapped it to a classic filesystem. This approach has wonderful potential, because you can now treat object storage as a regular filesystem, allowing you to use the regular tools you use for backup and replication.

Object storage has been around for a while, but I argue that Amazon S3 popularized it. Subsequent object storage solutions use the same concepts of PUT, GET, and DELETE and have different implementations of them, as well as other features. Although the Amazon cloud is large [18] – five times larger than the next 14 cloud providers – cloud storage is growing rapidly with many providers.

Wouldn't it be nice to be able to use a different or several cloud storage providers from the same backup or replication tool? One complicating factor is that you have to assume the data is accessible. Therefore, you need to start thinking about encrypting your data while encrypting the data transmission to your storage back end. Encryption should be a non-optional part of your data plans.

S3QL

One of the most interesting S3 backup/filesystem tools is s3ql [19]. It creates a POSIX-like filesystem in user space (FUSE) using object storage or other storage as the target. Its major features are:

  • Encryption: Thanks to Mr. Snowden, we now know that the US government has access to far more of our data and communications than we realized. Therefore, if I'm going to back up my data to cloud storage, I want to make sure the data is encrypted. S3QL encrypts all data using a 256-bit AES key. An additional SHA-256 HMAC checksum protects the data from manipulation.
  • Compression: S3QL compresses the data before storing, using either LZMA, bzip2, or gzip. This compression takes place before the data is encrypted.
  • De-duplication: The data to be stored can be de-duplicated if the files have identical content. It works across all files stored in the filesystem, but it can also work if only some parts of the files are identical (i.e,. not the entire file).
  • Copy-on-Write/Snapshotting: S3QL can duplicate directory trees without the use of any additional storage space (target storage). If one of the copies is modified, then only the part of the data that has been modified will take up additional storage space. You can use this capability to create intelligent snapshots that preserve the state of a directory at different points in time using a minimum amount of space.
  • Dynamic Size: The size of an S3QL filesystem can grow or shrink dynamically as required.
  • Transparency: An S3QL filesystem behaves like a local filesystem, in that it supports hard links, symlinks, typical permissions, extended attributes (xattr) and file sizes up to 2TB.
  • Immutable Trees: With S3QL, directory trees can be made immutable so that their contents can't be changed in any way whatsoever.
  • Range of Back Ends: S3QL has a range of back ends (what I call targets, but which S3QL calls storage back ends), such as Google storage, Amazon S3, Amazon Reduced Redundancy Storage (RRS), OpenStack Swift, Rackspace Cloud Files, S3-compatible targets, local filesystems, and even filesystems accessed using sshfs [20], which are treated as a local filesystems.
  • Caching: You can configure a cache for S3QL to improve apparent performance, perhaps taking advantage of a local SSD for data caching.

S3QL has other aspects that I really like, and I'll mention them throughout the remainder of the article.

Building and Installing S3QL

S3QL is primarily written in Python and comes in two versions: Python 2.7 (an older version of s3ql) and Python 3.3 (the more modern version). I chose to use the Python 2.7 version because s3ql has several dependencies and building it from scratch might not be the easiest thing to do. You do need to pay careful attention to the installation instructions [21]. In my case, I followed the instructions provided for CentOS on my CentOS 6.4 system. The instructions are very good and very accurate – be sure you read them all before installing S3QL. Briefly, I will mention some highlights of the installation process that I hope will help you.

To begin, turn off SELinux and reboot. You can find instructions for how to do this all over the web, but be sure the instructions are compatible with the security of your systems. After installing the dependencies, I followed the instructions for installing Python 2.7 from the PUIAS repo [22]. In case you are wondering, PUIAS is a project of the members of Princeton University and the Institute for Advanced Studies (IAS). Although these credentials don't make the site automatically safe, I did a little checking, and I think their repo is very safe.

Conveniently, Python 2.7 was installed alongside my existing Python. PUIAS even produces a distro called Springdale that might be worth checking out. (It looks to be based on CentOS 6.4.) Next, I installed SQLite from the Atomicorp repo [23]. I always like to check out repos that people suggest before I use them, and in this case I was satisfied enough to continue. The instructions had me install SQLite version 3.7.9: It is very important to remember this version number. The next step was to install Python APSW [24], where APSW stands for "Another Python SQLite Wrapper." You have to download the version of APSW that matches the version of SQLite you installed (see the previous step). This is critical; I tried several different versions of APSW that didn't match the version of SQLite I installed, and the S3QL build always failed.

After installing a few more Python packages, I built and installed S3QL itself. It was very easy to build, but when I ran the "test" as part of the installation (python2.7 setup.py test), I got an assertion error. I have no idea if this is expected behavior or not, but after testing S3QL for a few days, I didn't run into any unexpected behavior. When you install S3QL, be sure to pay attention to where the various pieces are installed. They should be part of the standard path, but it's always good to document where things are located.

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • HPC Cloud Storage

    Many HPC sites with petabytes of data need some sort of backup solution. Among the many candidates, cloud storage is a serious contender. In this article, we look at one solution with some serious advantages: S3QL.

  • An open source object storage solution
    We introduce the MinIO high-performance object store, its key features and applications, and some performance tips.
  • Comparing Ceph and GlusterFS
    Many shared storage solutions are currently vying for users’ favor; however, Ceph and GlusterFS generate the most press. We compare the two competitors and reveal the strengths and weaknesses of each solution.
comments powered by Disqus