Troubleshooting and maintenance in Ceph

First Aid

About the Quorum

This is how a quorum works: All monitoring servers talk to one other continuously during normal cluster operation, so they each know how many other monitoring servers are still available. If an array of several monitoring servers totals more than half the number of existing MONs in the cluster, it is considered quorate .

Normally, a Ceph cluster is a single cluster partition.

As long as that's the case, everything should be fine – it only becomes a problem the moment the monitoring servers lose touch with one other but do not actually fail themselves. This situation can happen, for example, if the network devices connecting the nodes fail. In this case, the cluster nodes then no longer see each other. If a monitoring server can only see half of the total available number of MONs or less, it acknowledges that it is inoperable because it is no longer quorate.

The clients accessing the cluster notice this immediately: A client in Ceph must first retrieve the status of the cluster from a MON server before it communicates directly with the OSD. If it asks a non-quorate MON, the MON sends it away. This process explains a peculiar effect that Ceph admins observe often, without initially being able to figure out what is happening: If ceph health or ceph -w or even direct access by a client does not work and an error is not returned directly (i.e., nothing happens), it is typically because the client is looking for an active MON.

If you observe something like this in a Ceph cluster, you can troubleshoot the current network situation, and often enough, you can fix the problem in good time.

Down and Out

Even Ceph clusters are not spared from regular maintenance tasks: Security updates, operating system updates, and Ceph updates are required on a regular basis. Because the Ceph architecture has no real single point of failure (SPOF), this is not a problem: Every single component in Ceph is replaceable. For short tasks, the Ceph developers have built in a useful function to help with maintenance: If one OSD fails, the cluster does not write it off immediately.

Ceph OSDs have two known states that can be combined. Up and Down only tells you whether the OSD is actively involved in the cluster. OSD states also are expressed in terms of cluster replication: In and Out . Only when a Ceph OSD is tagged as Out does the self-healing process occur; then, Ceph will check whether it needs to create new copies of replicas in the cluster, because replicas might have been lost through the failure of an OSD. You have about five minutes after Ceph tags an OSD Down before Ceph counts it Out . If that's not enough time, you can customize this value in the configuration file: The parameter is mon osd down out interval. The value is specified in seconds.

Dying Disks, Slow Requests

One mantra of the Ceph developers is that even SATA disks are suitable for use with Ceph, instead of the more expensive SAS drives. So far, so good. However, anyone who relies on a Ceph cluster with SATA desktop disks, should know one thing: Normal desktop disks are usually designed to correct every error automatically if this is somehow possible. At the moment the disk notices an error, it starts trying to iron it out internally.

With SATA desktop disks, this process can take a while. If a client accessed a PG on an OSD whose disk was currently in recovery mode, the client would remember that, because the request can take a very long time to complete. ceph -w lists these requests as Slow Requests . A similar situation would occur with a disk that is dying, but not so broken that the filesystem is generating error messages. So, what methods exist to prevent these annoying slow requests, especially at the hardware level?

Enterprise SATA drives are characterized by a higher mean time between failures (MTBF) and significantly faster error correction; however, they are also more expensive than their desktop counterparts. If you really want to work with desktop disks, you have a couple of remedies at hand for slow requests. The

ceph osd out <ID>

command, for example, forces the removal of an OSD from the cluster; the client would then see an error message when it tried to write and could then repeat the action later on. It would be redirected by the MON to another OSD that hopefully works (Figure 4).

Figure 4: A monitoring server that does not see enough MONs redirects clients that send requests to it.

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Ceph Maintenance

    We look into some everyday questions that administrators with Ceph clusters tend to ask: What do I do if a fire breaks out or I run out of space in the cluster?

  • Manage cluster state with Ceph dashboard
    The Ceph dashboard offers a visual overview of cluster health and handles baseline maintenance tasks; with some manual work, an alerting function can also be added.
  • Getting Ready for the New Ceph Object Store

    The Ceph object store remains a project in transition: The developers announced a new GUI, a new storage back end, and CephFS stability in the just released Ceph v10.2.x, Jewel.

  • Ceph object store innovations
    The Ceph object store remains a project in transition: The developers announced a new GUI, a new storage back end, and CephFS stability in the just released Ceph c10.2.x, Jewel.
  • What's new in Ceph
    Ceph and its core component RADOS have recently undergone a number of technical and organizational changes. We take a closer look at the benefits that the move to containers, the new setup, and other feature improvements offer.
comments powered by Disqus