Comparing Ceph and GlusterFS
Shared storage systems GlusterFS and Ceph compared
Big Data is a major buzzword today in terms of IT trends. Snappy observers sometimes comment that, although everyone might talk about the subject, no one really knows what it actually is. On the other hand, US-based InkTank and the Linux veteran Red Hat have been providing concrete contributions to the subject of Big Data for some time.
Specifically, this means the Ceph [1] object store and the GlusterFS [2] filesystem, which provide the underpinnings for Big Data projects. The term refers not only to storing data but also to the systemization and the ability to search efficiently through large data sets. For this process to work, the data first has to reside somewhere. This is obviously exactly where InkTank and Red Hat see a niche for their products, which both manufacturers are trying their very best to fill.
Endless Expanses
Both companies have made the same basic promise: Storage that can be created with GlusterFS or Ceph is supposed to be almost endlessly expandable. Admins will never again run out of space. This promise is, however, almost the only similarity between the two projects, because underneath, both solutions go about their business completely differently and achieve their goals in different ways. Anyone who has not, to date, dealt in great detail with one of the two solutions can hardly be expected to comprehend the basic workings of Ceph and GlusterFS right away – a comparison of the two projects is therefore not easy. In this article, we draw as complete a picture of the two solutions as possible and directly compare the functions of Ceph and GlusterFS. What is Ceph best suited for, and where do GlusterFS's strengths lie? Are there use cases in which neither one is any good?
Ceph – The Basics
Ceph and GlusterFS newcomers may have difficulty conceptualizing these projects. How does scalable storage work seamlessly in a horizontal direction? How do concrete solutions overcome physical limitations like hard drives, for example?
Those who dare take their first steps in this field with Ceph will be immediately faced with a complex collection of different tools, which take care of precisely that endless storage. Ceph belongs in the "Object Stores" camp, where the definition of this category follows the principle of the lowest common denominator. Object stores are so named because they store data in the form of binary objects. Besides Ceph, OpenStack Swift is another representative of this category currently on the free and open source software market. On the commercial side, Amazon's S3 probably works very similarly.
Binary Objects
Object stores rely on binary objects because they can easily be split into many small parts – if you put the individual parts back together in the original order later, you have exactly the same file as before. Up to the point where the objects are reassembled, the individual parts of the binary object can be stored in a distributed way. For example, they can be shared across multiple hard drives, which can be located on different servers. In this way, object stores bypass the biggest disadvantage that classic storage solutions have to contend with: rigid division into blocks.
As a basic rule, any data storage device (excluding magnetic tapes) that the average consumer can buy works in a block-based manner. Division into blocks is neither positive nor negative per se, but it does have a nasty side effect: It means that a data storage device cannot be used effectively on a block basis. You can store data on it, but it would be impossible to read the data later in a coordinated way without first scanning the entire disk for the stored information. Filesystems help solve the problem. They make a data storage device effectively usable and provide the basic structure. The drawback, however, is that filesystems are very closely connected with the associated data storage device. A filesystem on a data storage device cannot be easily cut into strips and transferred to other disks.
Ceph as an object store bypasses the restriction by adding an additional administrative layer to the block devices used. Ceph also uses block data storage, but the individual hard drives with filesystems for Ceph are only a means to an end. Internal administration occurs in Ceph based solely on its own algorithm and binary objects; the limits of participating data storage devices are no longer of interest.
Buy this article as PDF
(incl. VAT)