Management improvements, memory scaling, and EOL for FileStore
Refreshed
In the early days, Ceph was considered new, hip, and innovative because it promised scalable storage without ties to established storage solutions. If you were fed up with SANs, Fibre Channel, and complicated management tools, Ceph offered a solution that was refreshingly different, without a need for special and expensive hardware. In the meantime, Ceph has become a commodity – that is, an established solution for specific areas of application – with a market for training as well as for specialists with Ceph knowledge. Therefore, it's hardly surprising that new Ceph releases no longer come with masses of new features, unlike the past, but instead are taking things a little easier.
The newly released Ceph version 17.2 [1] bears witness to that trend: Instead of turning big wheels, the developers have focused on tweaks, with not too many changes for existing clusters. Only if you operate a very old Ceph cluster and still rely on the old on-disk FileStore format (Figure 1) will you be prompted to take immediate action. In this article, I explain why this is the case and what else you can and should expect from Ceph 17.2.
Goodbye FileStore
One of the most important innovations in Ceph 17.2 is undoubtedly that the old on-disk format FileStore from the early Ceph days is now marked as deprecated, although this does not yet have any practical consequences in everyday life. Owner Red Hat leaves no doubt that FileStore will be removed from Ceph sooner or later and urges admins of existing systems to upgrade their Ceph disks to the latest technology.
Like any storage solution, Ceph relies on block storage in the background to store its data. As an object store, however, the central function of the solution is to paint an abstraction layer between the physical stores on the one hand and the clients accessing them on the other. The clients do not need to worry about the physical architecture of the cluster, and you can use almost any number of physical devices as storage devices in the background. The vast majority of Ceph clusters today continue to rely on slow hard drives, at least for mass storage, because the break-even point has not yet been reached in terms of the cost per gigabyte for fast flash-based storage.
For the Ceph object store RADOS to store its data on block devices, it needs a structure. Earlier Ceph versions relied on a classic filesystem following the POSIX standard, which had to be created on the respective data carrier first to let the RADOS object memory-store binary objects. For many years, developers recommended XFS (Figure 2). In the background, however, people were eyeing up Btrfs, which promised considerable speed advantages in combination with RADOS. The construct of a POSIX filesystem and the metadata belonging to Ceph on object storage devices (OSDs) was subsequently named FileStore.
How Red Hat's Btrfs adventure ended is well known: In Red Hat Enterprise Linux (RHEL) 8, the filesystem was dropped from the distribution. Even in RHEL 7, Btrfs had only made it to Technical Preview status, so from Red Hat's point of view, the filesystem was never fit for production. This situation quickly became a problem for the Ceph developers because the stopgap, XFS, was increasingly proving to be a performance inhibitor.
On closer inspection, RADOS is designed to create new directories for new data on its OSDs and assign new object IDs instead of recycling old ones. Even if a user overwrites an existing file with one of the three standard interfaces in RADOS – CephFS, Ceph Block Device (aka RADOS block device, RBD), Ceph Object Gateway – RADOS does not replace existing objects in the background. Instead, it creates the new data as an equally new object and marks the old data as obsolete so that the data is overwritten in the short term.
The Problem with POSIX
The crux of the matter is that POSIX-compatible filesystems offer many integrity checks and consistency guarantees for cases such as overwriting existing data, the technical implementation of which nibbles away at performance. Ceph and its legacy format FileStore were in a lose-lose situation. Because effectively only XFS was available as an on-disk filesystem for OSDs, it was necessary to rely on its guarantees in terms of consistency and integrity, even though it made no sense at all in the Ceph context.
A few years ago, the Ceph developers finally ran out of patience and started working on a FileStore successor. Instead of a bloated POSIX filesystem, they determined it would be fine just to maintain some sort of database in an OSD's metadata that contains the physical memory address of an object on the disk. A key-value store would be good enough, flanked by a trunk filesystem that allows data to be stored on the physical device.
It took a while to settle on the right combination of tools. LevelDB has seen some use as a database, but it was not fast enough for the developers and soon fell out of favor. In the end, the race was won by RocksDB, an extremely fast key-value store by Facebook that became the core of the new on-disk format. To clearly distinguish the new format from the old, it was named BlueStore.
BlueStore soon became popular with Ceph users. Hacks like outsourcing the FileStore format journals to fast SSDs were no longer necessary thanks to the new solution. For classic workloads, speed increases of 50 percent and more could be achieved with BlueStore in individual cases, even without additional hardware [2]. The developers virtually stopped developing FileStore because they considered it a dead end; therefore, little more than the bare effort has taken place in Ceph for years to keep FileStore functional.
More Efficiency
BlueStore has been the default in Ceph for several releases, and the Ceph developers have publicly discussed giving FileStore the coup de grâce on the project's mailing list.
What makes sense from the user's point of view is only logical from the developer's point of view, too. Only old setups that have not been migrated for years are likely to still use FileStore, and it is hardly worthwhile to continue actively maintaining large amounts of code in Ceph, which is all the more true because BlueStore now lacks virtually any features that were once available in FileStore. If admins are still using FileStore today, it is probably because they have not taken the time to update.
Ceph developers are now homing in on precisely these admins with Ceph 17.2. The decision to flag FileStore as deprecated officially is a clear warning signal and the unequivocal indication that active Ceph clusters with FileStore will be excluded from future updates. If you are still running a FileStore-based cluster, ideally you will want to start converting your OSDs to BlueStore very soon.
Buy this article as PDF
(incl. VAT)