Distributed storage with Sheepdog
Data Tender
Kazutaka Morita launched the Sheepdog project [1] in 2009. The primary driver was the lack of an open source implementation of a cloud-enabled storage solution – as a kind of counterpart to Amazon's S3. Scalability, (high) availability, and manageability were and still are the essential characteristics that Sheepdog seeks to provide. For Morita, setting up Ceph, which already existed at that time, was just too complicated. GlusterFS in turn was largely unknown. What these tools had in common was that they then tended to address the NAS market.
Sheepdog was intended from the beginning as a storage infrastructure for KVM/QEMU [2] (see also the box "Virtual Images with Sheep"). Because Morita was working at NTT (Nippon Telegraph and Telephone Corporation), there are corresponding copyright notices in the source code, but the code itself is released under the GPLv2.
Virtual Images with Sheep
Sheepdog started life with the objective of being the storage system for QEMU. In 2009, this meant a data store for the hard disk images of the virtual servers. This paradigm is still anchored very deeply in the project. The management of the stored data is based on virtual disk images (VDIs).
The command-line tool uses the vdi
keyword to access the storage objects. Even for access via the POSIX layer, sheepfs
relies on VDIs behind the scenes. "Normal" files cannot be stored without a few tricks. The user first needs to create a VDI, provide it with a filesystem, and finally mount the VDI.
Sheepdog manages four types of information for each storage object: the actual data, the metadata (name, size, timestamp, ID), the status of the associated virtual server, and the VDI attributes. In principle, you can also squeeze "normal" files into this schema, if necessary, simply by ignoring the last two bits of information. This may prompt development in the scope of extending the Swift interface.
A Pack of Hounds
Very roughly, Sheepdog consists of three components: a clusterware, storage servers, and the network. There is not much to say on the subject of networking. Sheepdog is IP-based, which for most environments implies Ethernet as a transport medium. In principle, InfiniBand is feasible as long as the stack passes an IP address [3] to Sheepdog. The clusterware provides the basic framework for the servers that provide the data storage. The corresponding software is not actually supplied with the project. The admin must complete the installation and configuration before turning to Sheepdog.
The clusterware has three functions: It logs which machines are just part of the cluster, distributes messages to the nodes involved, and supports locking of storage objects. In the standard configuration, Sheepdog is set up for Corosync [4]. Alternatively, you can also use Zookeeper [5]. During compilation, you need to configure
the software with the --enable-zookeeper
option in place. Strictly speaking, however, Zookeeper is not actually clusterware. The software provides the functions listed above generically for distributed applications.
The project recommends the use of Corosync for up to 16 storage servers. The clusterware is integrated into the Sheepdog cluster. Each node runs the Corosync daemon and Sheepdog – there is a 1:1 relationship. For larger networks, Zookeeper is more appropriate. There is no 1:1 relation in this case. According to statements by the project, a Zookeeper cluster with three or five nodes is fine as a framework for hundreds of Sheepdog computers. If you need to be very economical, you can set up a pseudo-cluster on a single machine. The normal IPC (Inter Process Communication) takes on the task of distributing messages in this case. The remaining tasks of the clusterware are irrelevant if only one computer is involved.
The storage servers are the machines running Sheepdog that provide local disk space to the cluster (Figure 1). Normally, only one daemon process by the name of sheep
runs on each computer. It processes the I/O requests and takes care of storing the data. The participating storage servers are simply called sheep
in the project's jargon and are referenced as such in the documentation.
Admins use command-line options to manage the properties of the Sheepdog storage (Table 1). The most important option is to specify the directory where sheepdog manages the data. This can be a normal directory; however, the underlying filesystem must support extended attributes. For practical reasons, therefore, the use of XFS, ext3, or ext4 is recommended. For the latter, you need to make sure they are mounted with the user_xattr
option.
Table 1
Important Sheepdog Options
Option | Meaning | Examples |
---|---|---|
-b
|
IP address for communication | 192.168.200.34 , 0.0.0.0
|
-c
|
Clusterware to use | corosync , zookeeper:<Zookeeper information>
|
-D
|
Use direct I/O on the back end | n/a |
-g
|
Work as a gateway (server without back-end storage) | n/a |
-i
|
IP address for I/O requests | 192.168.100.34
|
-j
|
Use journal | dir=/sd-journal,size=8G
|
-p
|
Use this TCP port | 7000
|
-w
|
Enable caching for the storage objects | size=200G,dir=/sd-objcache
|
At this time, Sheepdog does not make extensive use of the extended attributes. They are currently used for caching checksums of storage objects, storing the size of the managed disks, and managing the filesystem structure of the POSIX layer (sheepfs
– more on that later). The sheep
daemon listens on the default configuration on TCP port 7000, which is the port that clients use to talk to the sheep. Make sure that port 7000 is not already occupied – especially if AFS servers are in use (see Listing 1).
Listing 1
Sheepdog Server in Action
# ps -ef|egrep '([c]orosyn|[s]heep)' root 491 1 0 13:04 ? 00:00:30 corosync root 581 1 0 1:13 PM ? 00:00:03 sheep -p 7000 /var/lib/sheepdog root 582 581 0 13:13 ? 12:00:00 AM sheep -p 7000 /var/lib/sheepdog # grep sheep /proc/mounts /dev/sdb1 /var/lib/sheepdog ext4 rw,relatime,data=ordered 0 0 # grep sheep /etc/fstab /dev/sdb1 /var/lib/sheepdog ext4 defaults,user_xattr 1 2 #
For serious use of Sheepdog, you should spend some time thinking about the network topology. For example, it's a good idea to separate the cluster traffic from the actual data. The documentation refers to an I/O NIC and a non-I/O NIC. Because both the clusterware and Sheepdog are IP based, the admin must configure the TCP/IP stack on both network cards and tell the sheep
daemon the I/O IP address. If this card fails, Sheepdog automatically switches to the remaining non-I/O NIC. Unfortunately, this does not work the other way round.
If you have looked into distributed storage systems previously, you are likely to pose the legitimate question of how you manage the metadata. Like GlusterFS, Sheepdog does not have dedicated instances for this purpose. It uses consistent hashing to place and retrieve data (for further information, see the "Metadata Server? No, Thanks!" box).
Metadata Server? No, Thanks!
If there is no metadata server in distributed storage, the question arises as to how the computers involved know where to store or pick up the data. In this context, the use of hash functions is a popular approach. In general, the algorithm assigns data of an arbitrary length to a datum with a fixed length. Well-known examples are checksums with MD5 [6] or SHA-1 [7].
In the case of distributed data storage systems, the hash function assigns an object to a target server. This mapping is either stored in a table or the software components in question do the computations on the fly. The size of the table is determined by the number of participating storage servers. If the size of the table changes, the previous assignments become invalid. It is then necessary to recompute the mappings.
In distributed storage, the number of storage servers is subject to change by design. This means that consistent hashing [8] is often used. Sheepdog is no exception. The algorithm used is called FNV-1a (Fowler-Noll-Vo-1a) after its inventors Glenn Fowler, Landon Curt Noll, and Phong Vo [9]. In this method, the variables are the initial FNV_offset_basis
and FNV_prime
. They are defined in the sheepdog_proto.h
file to make them immutable after compiling (Listing 2).
Listing 2
Sheepdog Uses the FNV-1a Algorithm
01 #define FNV1A_64_INIT ((uint64_t) 0xcbf29ce484222325ULL) 02 #define FNV_64_PRIME ((uint64_t) 0x100000001b3ULL) 03 04 /* 64 bit Fowler/Noll/Vo FNV-1a hash code */ 05 static inline uint64_t fnv_64a_buf(const void *buf, size_t len, uint64_t hval) 06 { 07 const unsigned char *p = (const unsigned char *) buf; 08 09 for (int i = 0; i < len; i++) { 10 hval ^= (uint64_t) p[i]; 11 hval *= FNV_64_PRIME; 12 } 13 14 return hval; 15 }
The result of the hash function is 64 bits in length. The name of the data object provides one input value, the offset another. Sheepdog stores the data in chunks of 4MB. The hash of the name is also used internally by the software to set the object ID. By the way, the project used a different algorithm, SHA-1 [10], in its early months.
The sheep
daemon can work in gateway mode. In this mode, the computer does not participate in actual data management (Figure 2). Instead, it serves as an interface between the clients and the "storage sheep." With the gateway, you can sort of hide the topology of the Sheepdog cluster from the computers that access it. This applies both to the structure of the network as well as to the flock of sheep.
A Peek Behind the Drapes
Admins can apply some tweaks to the Sheepdog cluster. As already mentioned, the software breaks down the data entrusted to it into chunks of 4MB. Sheepdog then distributes the chunks to the available sheep. There is no hierarchy here. All data items are located in the obj
subdirectory on the back-end storage. On request, Sheepdog can protect write access through journaling (see also Table 1).
The sheep
daemon first writes the data to its logfile. This step improves data consistency if a sheep "drops dead." As with the normal filesystem, use of a separate disk is recommended for journaling. If you use an SSD here, you can even improve the performance. Speaking of SSDs: Sheepdog has a Discard/Trim function. However, this is not aimed at the back-end storage but at the hard disk images of virtual servers. When the user releases space on the virtual machine, the VM can inform the underlying storage system – in this case, Sheepdog. This function is disabled in the default configuration, and it only works with a relatively new software stack, including QEMU version 1.5, a Linux 3.4 kernel, and the latest version of Sheepdog.
A sheep
process can manage multiple disks, which then form a local RAID 0 with the known advantages and disadvantages. The number of disks managed by Sheepdog is transparent to the members of the flock. Whether and how many disks a computer manages has no influence on the distribution of the data. If one drive fails, the other "sheep" just keep on grazing. The logfile of the sheep
daemon and information about the Sheepdog cluster are always stored on the first disk.
Safety First!
A good distributed storage system has to meet high-availability requirements. Replication, as a basic mechanism, has established itself here. In the standard configuration, Sheepdog sets a copy factor of 3
– the maximum value is 31
. This value cannot be changed later – at least, not without backing up all your data and recovering the storage cluster. Once configured, Sheepdog uses this replication factor for all data to be stored. For individual objects, however, you can set a different number of copies (Listing 3).
Listing 3
Defining Replication Factors
01 # dog vdi list 02 Name Id Size Used Shared Creation time VDI id Copies Tag 03 one.img 0 4.0 MB 0.0 MB 0.0 MB 2014-03-01 10:15 36467d 1 04 three.img 0 4.0 MB 0.0 MB 0.0 MB 2014-03-01 10:16 4e5a1c 3 05 two.img 0 4.0 MB 0.0 MB 0.0 MB 2014-03-01 10:15 a27d79 2 06 quark.img 0 4.0 MB 0.0 MB 0.0 MB 2014-03-01 10:16 dfc1b0 31 07 #
The software, however, does not perform any sanity checks. You can therefore specify a replication factor that is greater than the number of available storage computers, but only in theory. In practice, the software stops at its physical limits. Despite replication, Sheepdog still insists on distributing the data chunks across the available sheep. This means that the copies are as wildly distributed as the data itself. If a server fails, Sheepdog reorganizes the chunks to keep the replication factor. The same thing happens when the failed server resumes its service. If this behavior is not desired, you need to disable Auto Recovery by stipulating:
dog cluster recover disable
Data availability via replication of course costs disk space. In the standard configuration, the Sheepdog cluster must provide three times the nominal capacity. You can save a little space with erasure coding [11], which Sheepdog also supports. The configuration options are similar to the copy factor. You define the default setting for all data when setting up the Sheepdog network. You can assign different values for individual objects to change the number of data or parity stripes. The maximum for the latter is 15
. The powers of two between 2
and 16
are configurable for data stripes.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.