Fixing Ceph performance problems
First Aid Kit
Network Problems
One huge error domain in Ceph is, of course, problems with the network between the client and RADOS and between the drives in RADOS. Anyone who has dealt with the subject of networks in more detail knows that modern switches and network cards are highly complex and have a large number of tweaks and features that influence the achievable performance. At the same time, Ceph itself has almost no influence on this, because there's not much that can be tuned on its network. It uses the TCP/IP stack available in the Linux kernel and relies on the existing network infrastructure to function correctly.
For the admin, this means that you have to monitor your hardware meticulously in terms of networking, especially as it applies to typical performance parameters. However, this is also easier said than done. The standard interface to access the relevant information by almost all modern devices is SNMP. However, practically every manufacturer does its own thing when it comes to collecting the relevant data.
Mellanox, for example, incorporates in its application-specific integrated circuits (ASICs) a function known as What Just Happened, which supports the admin in finding dropped frames or other problems. Other manufacturers, such as Cisco and Juniper, also have monitoring tools for their own hardware on board. However, no all-round package exists, and depending on the local setup, it ultimately boils down to a complete solution built by the brand itself. However, if you are careful to use switches with Linux firmware, you are not totally dependent on the specific solutions invented by the vendors.
If you experience performance problems in Ceph, the first step is to find out where they are happening. If they occur between the clients and the cluster, different network hardware might be responsible, rather than communication between the OSDs. In case of doubt, it is helpful to use tools such as iPerf to move the cluster around and measure the results manually.
Problems with Ceph
Assuming the surrounding infrastructure is working, a performance problem in Ceph very likely has its roots in one of the software components, which could affect RADOS itself; however, the kernel in the systems running Ceph might also be the culprit.
Fortunately, you have many monitoring options, especially at the Ceph level. The Ceph developers know about the complexity of performance monitoring in their solution. In case of an incident, you first have to identify the primary and secondary OSDs for each object and then search for the problem.
If you have thousands of OSDs in your system, you are facing a powerful enemy, which is why the Ceph OSDs meticulously record metrics for their own performance. Therefore, if an OSD notices that writing a binary object is taking too long, it sends a corresponding slow-write message to all OSDs in the cluster, thus decentrally storing the information at many locations.
Additionally, Ceph offers several interfaces to help you access this information. The simplest method is to run the ceph -w
command on a node with an active MON. The output provides an overview of all problems currently noted in the system, including slow writes. However, this method cannot be automated sensibly.
Some time ago, the Ceph developers therefore added another service to their software – mgr
(for manager) – that collects the metric data recorded by the OSDs, making it possible to use the data in monitoring tools such as Prometheus.
Dashboard or Grafana
A quick overview option is provided by the Ceph developers themselves: The mgr
component made its way into the object store along with the Ceph Dashboard [3], which harks back to the openAttic project and visualizes the state of the Ceph cluster in an appealing way. The Dashboard also highlights problems that Ceph itself has detected.
However, this solution is not ideal for cases in which a functioning monitoring system (e.g., Prometheus) is already running, because the alert system usually has a direct connection to it. The Ceph dashboard usually cannot be integrated directly, but the mgr
component provides a native interface to Prometheus, to which metric data from Ceph can be forwarded. Because Telegraf also supports the Prometheus API, the data can find its way into InfluxDB, if desired.
Grafana is also on board. The Grafana store offers several dashboards for the labels Prometheus uses to read metric data from Ceph; these dashboards visualize the most important metrics for a quick overview.
Buy this article as PDF
(incl. VAT)