Time-series-based monitoring with Prometheus

Lighting Up the Dark

Prometheus Metrics

Prometheus basically supports counters and gages as metrics; they are stored as entries in the time series database. A counter metric,

http_requests_total{status="200",method="GET"}@1434317560938 = 94355

comprises an intuitive name, a 64-bit time stamp, and a float64 value. A metric can also include labels, stored as key-value pairs, that offer a very easy approach to making a multidimensional metric. The above example extends the number of HTTP requests by adding the status and method properties. In this way, you can specifically look for clusters of HTTP 500 errors.

Take care: Each key-value pair of labels generates its own time series, which dramatically increases the time series database space requirements. The Prometheus authors therefore advise against using labels to store properties with high cardinality. As a rule of thumb, the number of possible values should not exceed 10. For data of high cardinality, such as customer numbers, the developers recommend analyzing the logfiles, or you could also operate several specialist instances of Prometheus.

At this point, note that for a Prometheus server with a high number of metrics, you should plan a bare-metal server with plenty of memory and fast hard drives. For smaller projects, the monitoring software can also run on less powerful hardware: Prometheus monitors my VDSL connection at home on a Raspberry Pi 2 [19].

Prometheus has limits for storing metrics. The storage.local.retention command-line option states how long it waits before deleting the acquired data. The default is two weeks. If you want to store data for a longer period of time, you can forward individual data series via federation to a Prometheus server with a longer retention period.

The options for storing data in other time series databases like InfluxDB or OpenTSDB are in a state of flux. The developers also are currently replacing the existing write functions with a generic approach that will enable read access. For example, a utility with read and write functionality has been announced for InfluxDB.

The official best practices by the Prometheus authors [20] is very useful reading that will help you avoid some problems up front. If you provide Prometheus metrics yourself, make sure you specify the time values in seconds and the data throughput in bytes.

Data Retrieval with PromQL

The PromQL query language retrieves data from the Prometheus time series database. The easiest way to experiment with PromQL is in the web interface (Figure 2). In the Graph tab you can test PromQL queries and look at the results. A simple query for a metric provided by node_exporter, such as

node_network_receive_bytes

has several properties, including, among other things, the names of the network interfaces. If you only want the values for eth0, you can change the query to:

node_network_receive_bytes{device='eth0'}

The following query returns all acquired values for a metric for the last five minutes:

node_network_receive_bytes{device='eth0'}[5m]

At first glance, the query

rate(node_network_receive_bytes{device='eth0'}[5m])

looks similar, but far more is going on here: Prometheus applies the rate() function to the data for the last five minutes, which determines the data rate per second for the period.

This kind of data aggregation is one of Prometheus' main tasks. The monitoring software assumes that instrumented applications and exporters only return counters or measured values. Prometheus then handles all aggregations.

The data rates of all network interfaces are found for the individual interfaces with the query:

sum(rate(node_network_receive_bytes[5m])) by (device)

These examples only scratch the surface of PromQL's possibilities [21]. You can apply a wide variety of functions to metrics, including statistical functions such as predict_linear() and holt_winters().

Sound the Alarm

Prometheus uses alerting rules to determine alert states. Such a rule contains a PromQL expression and an optional indication of the duration, as well as labels and annotations, which the alert manager picks up for downstream processing. Listing 1 shows an alert rule that returns true if the PromQL expression up == 0 is true for a monitoring target for more than five minutes. As the example shows, you can also use variables in the comments.

Listing 1

Alerting Rule

01 # Monitoring target down for more than five minutes
02 ALERT InstanceDown
03   IF up == 0
04   FOR 5m
05   LABELS { severity = "page" }
06   ANNOTATIONS {
07     summary = "Instance {{ $labels.instance }} down",
08     description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
09   }

If an alert rule is true, Prometheus triggers an alarm (is firing). If you specified an alert manager, Prometheus forwards the details of the alert to it via HTTP. This happens while the alert condition for the alert exists and restarts whenever the rule is evaluated; thus, letting Prometheus notify you directly would not be a good idea.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus