Ceph object store innovations
Renovation
When Red Hat took over Inktank Storage in 2014, thus gobbling up the Ceph [1] object store, a murmur was heard throughout the community: on the one hand, because Red Hat already had a direct competitor to Ceph in its portfolio in the form of GlusterFS [2], and on the other because Inktank was a such a young company – for which Red Hat had laid out a large sum of cash. Nearly two years later, it is clear that GlusterFS plays only a minor role at Red Hat and that the company instead is relying increasingly on Ceph.
In the meantime, Red Hat has further developed Ceph, and many of the earlier teething problems have been resolved. In addition to bug fixing, developers and admins alike were looking for new features: Many admins find it difficult to make friends with Calamari, the Ceph GUI promoted by Red Hat; they need alternatives. The CephFS filesystem, which is the nucleus of Ceph, has been hovering in beta for two years and needs finally to be approved for production use. Moreover, Ceph repeatedly earned criticism for its performance, failing to compete with established storage solutions in terms of latency.
Red Hat thus had more than enough room for improvement, and much has happened in recent months. A good reason to take a closer look at Ceph is to determine whether the new functions offer genuine benefits for admins in everyday life and see what the new GUI has to offer.
CephFS: A Touchy Topic
The situation seems paradoxical: Ceph started life more than 10 years ago as a network filesystem. Sage Weil, the inventor of the solution, was looking into distributed filesystems at the time in the scope of his PhD thesis. It was his goal to create a better Lustre filesystem that got rid of Luster's problems. However, because Weil made the Ceph project his main occupation and was able to create a marketable product, the original objective fell by the wayside: Much more money was up for grabs by surfing the cloud computing wave than by sticking with block storage for virtual machines.
Consequently, much time was put into the development of the librbd library, which now offers a native Ceph back end for Qemu. Also, storage services in the same vein as Dropbox appear commercially promising: Ceph's RADOS gateway provides this function in combination with a classic web server. If you were waiting for CephFS, you mainly had to settle for promises. Several times Weil promised that CephFS would be ready – soon. However, for a long time, little happened in this respect.
Besides Weil's change in focus, the complexity of the task is to blame. On one hand, a filesystem such as CephFS must be POSIX compatible because it could not otherwise be meaningfully deployed. On the other hand, the Ceph developers have strict requirements for their solution: Each component of a Ceph installation has to scale seamlessly.
A Ceph cluster includes at least two types of services: a demon that handles the object storage device (OSD) and the monitor servers (MONs). The OSD ensures that the individual disks can be used in the cluster, and the MONs are the guardians of the cluster, ensuring data integrity. If you want to use CephFS, you need yet another service: the metadata server (MDS) (Figure 1).
Under the hood, access to Ceph storage via CephFS works almost identically to access via block device emulation. Clients upload their data to a single OSD, that automatically handles the replication work in the background. For CephFS to serve up the metadata in a POSIX-compatible way, the MDSs act as standalone services.
Incidentally, the MDSs themselves do not store the metadata belonging to the individual objects. The data is instead stored in the user-extended attributes of each object on the OSD. Basically, the MDS instances in a Ceph cluster only act as caches for the metadata. If they did not exist, accessing a file in CephFS would take a long time.
Performance and Scalability
For the metadata system to work in large clusters with high levels of concurrent access, MDS instances need to scale arbitrarily horizontally, which is technically complex. Each object must always have a MDS, which is ultimately responsible for the metadata of precisely that object. It must thus be possible to assign this responsibility dynamically for all objects in the cluster. When you add another MDS to an existing cluster, Ceph needs to take care of assigning objects to it automatically.
The solution the Ceph developers devised for this problem is smart: You simply divide the entire CephFS tree into subtrees and dynamically assign responsibility for each individual tree to MDS instances. When a new metadata server joins the cluster, it is automatically given authority over a subtree. This needs to work at any level of the POSIX tree: Once all the trees at the top level are assigned, the Ceph cluster needs to partition the next lower level.
This principle is known as dynamic subtree partitioning (DSP) and has already cost Weil and his team some sleepless nights. The task of controlling the assignment of the POSIX metadata trees dynamically for individual MDSs proved to be highly complicated, which is one of the main reasons CephFS has not yet been released for production.
The good news: In the newest version of Ceph (Jewel v10.2.x), the developers were looking to stabilize CephFS. Although, that version was not available at the time of this review, the work on the DSPs was largely completed. Recently, the Jewel version of CephFS was announced to be stable – "with caveats" [3].
New Functions for CephFS
In addition to a working DSP implementation, the Jewel release of CephFS offers more improvements. So far, one of the biggest criticisms was the lack of a filesystem checking tool for CephFS. This criticism regularly raises its head in the history of virtually all filesystems. You might recall the invective to which fsck
in ReiserFS was exposed.
If you have a network filesystem, the subject of fsck is significantly more complex than local filesystems that exist only on a single disk. On the one hand, a Ceph fsck needs to check whether the stored metadata are correct; on the other hand, it needs to be able to check the integrity of all objects belonging to a file on request by the admin. This is the only way ultimately to ensure that the admin can download every single file from the cluster. If recovery is no longer possible, the cluster at least needs to notify the admin to get the backups out of the drawer.
Jewel will include a couple of retrofits in terms of a filesystem check: Greg Farnum, who worked on Ceph back in the Inktank era, has taken up the gauntlet. Together with John Spray, he partly consolidated and revised the existing approaches for the Jewel version of Ceph. The result is not a monolithic fsck tool, but a series of programs that are suitable for different purposes: cephfs-data-scan
can rebuild the pool of metadata in CephFS, even if it is completely lost. The concept of the damage table is also new: The idea is for the CephFS repair tools to use the table to "remember" where they discovered errors in the filesystem. The Ceph developers are looking to avoid the situation in which clients keeping on trying to read the same defective data. The table also acts as the starting point for subsequent repair attempts. Finally, the damage table also helps improve the stability of the system: If the Ceph tools find a consistency problem in a CephFS-specific subtree, they mark the individual subtree as defective in the future. Previously, it was only possible to circumnavigate larger areas of the filesystem.
One thing is clear: The CephFS changes in Jewel make this product significantly more mature and create a real alternative to GlusterFS. Although the first stable version of CephFS will still contain some bugs, The Ceph developers in recent months will definitely have made sure that any existing bugs will not cause loss of data. If you want to use CephFS in the scope of a Red Hat or Inktank support package, Jewel is likely to give you the opportunity to do just that, because Jewel will be a long-term support release.