Monitoring, alerting, and trending with the TICK Stack
Cloud Radar
Anyone who needs to monitor large IT setups (e.g., a cloud) faces a challenge: Nodes come and go, and not every departure is a failure that needs to trigger an alert. In addition to monitoring and alerting, trending is also necessary; in many cases, it is the only way you can know when to add hardware to compensate for an increased base load.
Soon it becomes clear that typical monitoring solutions such as Nagios or Zabbix will not do. If you look into the subject in more detail, you end up with time series databases. In this article, I introduce the four components of the TICK Stack [1] (Telegraph, InfluxDB, Chronograf, and Kapacitor) and explain their respective strengths.
Deficits and Alternatives
The most prominent representative of this genre is probably Prometheus [2]. Launched as an internal tool by SoundCloud, the program and the additional components attached to it are now popular, but power users complain: In many respects Prometheus is missing functions, and design decisions were made that are not a good match for many setups.
An example is Prometheus Node Exporter, which is designed to collect metrics from the systems in the environment, often not in a way that the administrator desires (see the "Prometheus Add-ons" article in the previous issue [3]). Moreover, with a Prometheus server, you cannot store metrics redundantly and in a distributed storage system.
If your setup becomes too large for a single Prometheus instance, you have to split it, thus possibly canceling out one of the biggest advantages of a monitoring, alerting, and trending (MAT) system – namely, the single point of administration.
Additionally, Prometheus slows down as the volume of data increases. The program is fine for short and mid-term trending, but if you want to keep trending data safe for years, you will reach the limits of Prometheus as it loses much of its speed.
This situation is amazing considering the software only stores measured values numerically. Other solutions that store log messages as strings in the database can handle far larger amounts of data. To cut a long story short: You have good reasons not to build your platform for MAT on the current king of the hill, Prometheus.
If you are looking for an alternative to Prometheus, you might find InfluxDB, a component of the TICK Stack, useful. At the end of the article, I show that Prometheus and the TICK Stack are not necessarily mutually exclusive.
InfluxDB
The heart of the TICK Stack is the InfluxDB time series database [4]. Note that time series databases do not use tables, like classical databases, as a central element for data management; instead, they align all stored data in a time stream.
Time series databases are especially practical for the study of trends, with a focus on the value of a parameter over time, so you can guess how this value might develop in the future. For example, if you know how RAM usage has increased over the last three months, you can recognize when you might have to buy additional hardware to avoid resource bottlenecks.
A query of this kind is problematic for traditional databases: They need to browse through tables and rows and compare them with a time you specify as a parameter. If an entry falls within the specified period, it is part of the result. A graph can only be drawn at the end. The process is time-consuming, because the format of the result is different from the format used internally by the database, and the transposition of the data required for this process is resource intensive.
Time series databases, on the other hand, store the data exactly as you would like to view it later, thus saving a great deal of overhead. Because many system parameters can be displayed with reference to measured data, monitoring is a kind of waste product of trending in MAT environments, at least if you add components to the time series database that can trigger alarms according to specific metrics.
InfluxDB is the time series database in the TICK Stack (Figure 1). The tool differs noticeably from Prometheus in many areas. One big difference is that InfluxDB can use not only numbers as measured values, but also strings, which is especially practical if you want to log events. With Prometheus, you would need to take a round-about route: If a message containing the word ERROR appears in the kernel log, Prometheus can display Log messages with ERROR ; however, Prometheus cannot display the details of the problem, such as the error message itself. InfluxDB, on the other hand, can.
Fundamental Differences
InfluxDB differs from Prometheus in other ways, too. Starting with the basics, Prometheus collects metrics directly from hosts (the pull principle). InfluxDB, on the other hand, expects a separate process to deliver values. Both principles have their fans, and depending on the context, one or the other might be more suitable. InfluxDB cannot claim a fundamental advantage; nevertheless, you need keep in mind how you collect data when planning your future setup.
The two solutions also differ in the way you query data from the database (e.g., to interpret it graphically). Prometheus uses a modification of SQL (PromQL) specially designed for use in MAT solutions, whereas InfluxDB uses a genuine dialect of SQL (InfluxQL) that would meet your needs if you are already familiar with SQL. Although PromQL is not very difficult to learn, it would be wrong to say you could use it without any preparation.
Buy this article as PDF
(incl. VAT)