Grafana and time series databases
More than a Thousand Words
Admins are rumored to feel more at ease working with text-based terminals than with graphical tools, and the command line often is better suited to classic admin tasks than a GUI because it allows scripting and direct input. For other admin tasks, though, this assessment is typically inverted: When it comes to processing measurement data and statistics, visual tools are clearly superior. In particular, data for monitoring, alerting, and trending (MAT) that comes from several sources requires visualization for meaningful analysis.
Virtually every large environment, whether a container platform or a public cloud environment, is strongly dependent on MAT. Only MAT provides reliable clues on the health and usage of systems that let you know when you need to add new hardware because your platform is currently fully utilized. Grafana [1] targets admins who need MAT analysis.
Match Winner
Grafana cooperates with various back ends and can pull the data you want to display from many sources. Its developers refer to this configuration as data-driven architecture. In place of classic event monitoring, in which monitoring is a spin-off of the need to collect various metrics continuously, Grafana uses a time-series-based principle. For example, if you operate a multinode cluster for MySQL based on Galera, you will want the load on all database back ends to be equally high. If system load on one of the back ends suddenly drops off, it is a certain indicator that something is wrong.
Unlike typical incident monitoring in the style of Nagios, Grafana bases its conclusions on performance data. Although Grafana is not primarily about monitoring, it does help you prepare the corresponding time series from monitoring systems in an easily interpretable way. In this article, I first highlight the key features of Grafana and then present the most important back ends from which it can draw its information.
New Monitoring Paradigm
To understand the motivation behind Grafana, you need to take a small excursion into the world of monitoring. A paradigm change has taken place in the past few years that, in turn, is closely linked to cloud computing. Monitoring a cloud is different from monitoring conventional IT platforms, which to a certain degree are static and, after setting up the environment, change only in the details. The standard tools for monitoring are well known to experienced admins: Nagios, Icinga, Check_MK, and various other solutions of same design.
Monitoring in conventional environments relies on events. If a service stops running on a server, the monitoring system notices and raises an alert. Trending plays a minor role in this classic scenario, because the workload of such a setup will tend to grow evenly, giving you sufficient time to purchase new hardware. That said, even conventional monitoring solutions cannot completely do without trending. For example, PNP4Nagios [2] uses checks to collect performance data and then displays the data in a graphical format directly in the Nagios web interface.
This arrangement no longer works for a public cloud, because it is not predictable when the platform will need to scale horizontally by adding new servers. For example, a new customer with a huge workload could easily set up an account in a typical public cloud and start generating a massive load.
Trending Becomes More Important
For newer types of platforms, trending therefore plays a bigger role than for its conventional predecessors. PNP4Nagios or comparable solutions look more like stop-gaps in these cases. They normally store their measurement data in the background in a normal database, usually MySQL, which is perfectly suited for classic, event-based monitoring, in which incidents become separate entries in a table. Because you are interested in the individual events in this scenario and the data is stored exactly that way in MySQL, you will have no problem retrieving or processing the data.
Trending changes the rules of the game because it does not focus on events, but on the evolution of performance data over time. In the case of trending, you no longer want to know whether or not a specific service was working at a given time; instead, you are interested in the central health values of the systems (e.g., CPU load and RAM usage). If their values remain high over a period of time, new hardware will be required for load balancing.
However, if you want to generate this information from individual events stored in a MySQL database, it would require many individual database queries, and thus a correspondingly high load. Also, it takes quite a while for MySQL or a similar database to provide answers to these queries.
Buy this article as PDF
(incl. VAT)