Troubleshooting and maintenance in Ceph

First Aid

About the Quorum

This is how a quorum works: All monitoring servers talk to one other continuously during normal cluster operation, so they each know how many other monitoring servers are still available. If an array of several monitoring servers totals more than half the number of existing MONs in the cluster, it is considered quorate .

Normally, a Ceph cluster is a single cluster partition.

As long as that's the case, everything should be fine – it only becomes a problem the moment the monitoring servers lose touch with one other but do not actually fail themselves. This situation can happen, for example, if the network devices connecting the nodes fail. In this case, the cluster nodes then no longer see each other. If a monitoring server can only see half of the total available number of MONs or less, it acknowledges that it is inoperable because it is no longer quorate.

The clients accessing the cluster notice this immediately: A client in Ceph must first retrieve the status of the cluster from a MON server before it communicates directly with the OSD. If it asks a non-quorate MON, the MON sends it away. This process explains a peculiar effect that Ceph admins observe often, without initially being able to figure out what is happening: If ceph health or ceph -w or even direct access by a client does not work and an error is not returned directly (i.e., nothing happens), it is typically because the client is looking for an active MON.

If you observe something like this in a Ceph cluster, you can troubleshoot the current network situation, and often enough, you can fix the problem in good time.

Down and Out

Even Ceph clusters are not spared from regular maintenance tasks: Security updates, operating system updates, and Ceph updates are required on a regular basis. Because the Ceph architecture has no real single point of failure (SPOF), this is not a problem: Every single component in Ceph is replaceable. For short tasks, the Ceph developers have built in a useful function to help with maintenance: If one OSD fails, the cluster does not write it off immediately.

Ceph OSDs have two known states that can be combined. Up and Down only tells you whether the OSD is actively involved in the cluster. OSD states also are expressed in terms of cluster replication: In and Out . Only when a Ceph OSD is tagged as Out does the self-healing process occur; then, Ceph will check whether it needs to create new copies of replicas in the cluster, because replicas might have been lost through the failure of an OSD. You have about five minutes after Ceph tags an OSD Down before Ceph counts it Out . If that's not enough time, you can customize this value in the configuration file: The parameter is mon osd down out interval. The value is specified in seconds.

Dying Disks, Slow Requests

One mantra of the Ceph developers is that even SATA disks are suitable for use with Ceph, instead of the more expensive SAS drives. So far, so good. However, anyone who relies on a Ceph cluster with SATA desktop disks, should know one thing: Normal desktop disks are usually designed to correct every error automatically if this is somehow possible. At the moment the disk notices an error, it starts trying to iron it out internally.

With SATA desktop disks, this process can take a while. If a client accessed a PG on an OSD whose disk was currently in recovery mode, the client would remember that, because the request can take a very long time to complete. ceph -w lists these requests as Slow Requests . A similar situation would occur with a disk that is dying, but not so broken that the filesystem is generating error messages. So, what methods exist to prevent these annoying slow requests, especially at the hardware level?

Enterprise SATA drives are characterized by a higher mean time between failures (MTBF) and significantly faster error correction; however, they are also more expensive than their desktop counterparts. If you really want to work with desktop disks, you have a couple of remedies at hand for slow requests. The

ceph osd out <ID>

command, for example, forces the removal of an OSD from the cluster; the client would then see an error message when it tried to write and could then repeat the action later on. It would be redirected by the MON to another OSD that hopefully works (Figure 4).