Comparing Ceph and GlusterFS
Shared storage systems GlusterFS and Ceph compared
Replication
An essential aspect for high availability of data in shared storage solutions is producing copies. This is called replication in GlusterFS-speak, and yes, there is a translator for it. It is left to the admin's paranoia level to determine how many copies are automatically produced.
The basic rule is that the number of bricks used must be an integer multiple of the replication factor. Incidentally, this decision also has consequences for the procedure when growing or shrinking the GlusterFS cluster – but more on that later. Replication occurs automatically and basically transparently to the user. The number of copies wanted is set per volume. Thus, completely different replication factors may be present in a GlusterFS network. In principle, GlusterFS can use either TCP or RDMA as a transport protocol. The latter is preferable when it comes to low latency.
Realistically speaking, RDMA plays rather an outside role. Where replication traffic occurs, it depends on the client's access method. If the native GlusterFS filesystem driver is used, then the sending computer ensures that the data goes to all the necessary bricks. It does not matter which GlusterFS server the client addresses when mounting. In the background, the software establishes the connection to the actual bricks.
GlusterFS handles access via NFS internally. If it's suitably configured, the client-server channel does not need to process the copy data stream. Geo-replication is a special treat; it involves asynchronous replication to another data center (Figure 4). The setup is simple and is essentially based on rsync
via SSH. It does not matter whether there is a GlusterFS cluster or just a normal directory on the opposite side.
At present, geo-replication is a serial process and suffers from rsync
's known scaling problems. Also, the process runs with root privileges on the target site in the default configuration and thus represents a security risk. Version 3.5 is said to have completely revised this situation. Replication can then be parallelized. Identifying which changes have been made, and are thus part of the copy process, is also said to be much faster with far higher performance.
The failure of a brick is transparent for the user with a replicating volume – except where it affects the last remaining brick. The user can continue working while the repair procedures take place. After the work is completed, GlusterFS can reintegrate the brick and then automatically start data synchronization. If the brick is irreparable, however, some TLC is needed. The admin must throw it out of the GlusterFS cluster and integrate a new brick. Finally, the data is adjusted so that the desired number of copies is again present. For this reason, implementing a certain fail-safe operator outside of GlusterFS for the individual bricks is a good idea. Possible measures would be the use of RAID controllers and/or redundant power supplies and interface cards.
Accessing users and applications via NFS on GlusterFS requires further measures. The client machine only has one connection to an NFS server. If this fails, access to the data is temporarily gone. The recommended method is to set up a virtual IP which, in the event of an error, points to a working NFS server. In its commercial product, Red Hat uses the cluster trivial database (CTDB) [9], which is well-known from the Samba world.
As already mentioned, the servers involved in the GlusterFS composite enter a trust relationship. For all distributed network-based services, failure of switches, routers, cables, or interface cards is annoying. If the GlusterFS cluster is divided by such an incident, the question arises of who can write and who can't. Without further measures, unwanted partitioning in the network layer creates a split-brain scenario with the accompanying data chaos. GlusterFS introduced quorum mechanisms in version 3.3. In the event of a communication failure, the GlusterFS server can decide whether or not the brick can still write. The setup is quite simple. You can turn off the quorum completely, set it to automatic, or set the number of bricks that are to be considered a majority. If done automatically, a quorum applies if more than half of the participating "bricks" are still working.
Replication with Ceph
Although conventional storage systems often use various tricks to ensure redundancy over replication, the subject of replication (in combination with high data availability) is almost inherently incorporated in the design at Ceph. Weil has put in a lot of effort into, on one hand, making reliable replication possible and, on the other hand, making sure the user notices as little as possible about it. The basis for replication in the background is the conversation that runs within a cluster between individual OSDs: Once a user uploads a binary object to an OSD, the OSD notices this process and starts to replicate. It determines for itself – on the basis of the OSD and MON servers existing in the cluster and using the CRUSH algorithm – to which OSDs it needs to copy the new object and then does so accordingly.
Using a separate value for each pool (which is the name for logical organizational units into which a Ceph cluster can be divided), the administrator determines how many replicas of each object should be present in this pool. Because Ceph basically works synchronously, a user only receives confirmation for a copy action at the moment at which the corresponding number of copies of the objects uploaded by the clients exist cluster-wide.
Ceph also can heal itself if something goes wrong: It acknowledges the failure of a hard drive after a set time (default setting is five minutes) and then copies all missing objects and their replicas to other OSDs. This way, Ceph ensures that the admin's replication requirements are consistently met – except for the wait immediately after the failure of a disk, as mentioned earlier. The user will not notice any of this, by the way; the unified storage principle hides failures from clients in cluster.
Front Ends – A Supplement
At first glance, it may seem odd to ask about programming interfaces for a storage solution. After all, most admins are familiar with SAN-type central storage and know that it offers a uniform interface – nothing more. This is usually because SAN storage works internally according to a fixed pattern that does not easily allow programmatic access: What benefits would programming-level access actually offer in a SAN context?
With Ceph, however, things are a bit different. Because of its design, Ceph offers flexibility that block storage does not have. Because it manages its data internally, Ceph can basically publish any popular interface to the outside world as long as the appropriate code is available. With Ceph, you can certainly put programming libraries to good use that keep access to objects standardized. Several libraries actually offer this functionality. Librados is a C library that enables direct access to stored objects in Ceph. Additionally, for the two front ends, RBD and CephFS, separate libraries called librbd
and libcephfs
, support userspace access to their functionality.
On top of that, several bindings for various scripting languages exist for librados
so that userspace access to Ceph is possible here, too – for example, using Python and PHP.
The advantages of these programming interfaces can be illustrated by two examples: On one hand, the RBD back end in QEMU was realized on the basis of librbd
and librados
to allow QEMU to access VMs directly in Ceph storage without going through the rbd.ko
kernel module. On the other, a similar functionality is also implemented in TGT so that the tgt
daemon can create an RBD image as an iSCSI export without messing around with the operating system kernel.
A useful use case also exists for PHP binding: Image hosters work with large quantities of images, which in turn are binaries files. These usually reside on a POSIX-compatible filesystem [10], however, and are exported via NFS, for example. However, POSIX functionality is almost always unnecessary in such cases. A web application can be built via the PHP binding of librados
that stores images in the background directly in Ceph so that the POSIX detour is skipped. This approach is efficient and saves resources that other services can benefit from.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.