« Previous 1 2 3 4 Next »
Time-series-based monitoring with Prometheus
Lighting Up the Dark
Prometheus Metrics
Prometheus basically supports counters
and gages
as metrics; they are stored as entries in the time series database. A counter metric,
http_requests_total{status="200",method="GET"}@1434317560938 = 94355
comprises an intuitive name, a 64-bit time stamp, and a float64 value. A metric can also include labels
, stored as key-value pairs, that offer a very easy approach to making a multidimensional metric. The above example extends the number of HTTP requests by adding the status
and method
properties. In this way, you can specifically look for clusters of HTTP 500 errors.
Take care: Each key-value pair of labels generates its own time series, which dramatically increases the time series database space requirements. The Prometheus authors therefore advise against using labels to store properties with high cardinality. As a rule of thumb, the number of possible values should not exceed 10. For data of high cardinality, such as customer numbers, the developers recommend analyzing the logfiles, or you could also operate several specialist instances of Prometheus.
At this point, note that for a Prometheus server with a high number of metrics, you should plan a bare-metal server with plenty of memory and fast hard drives. For smaller projects, the monitoring software can also run on less powerful hardware: Prometheus monitors my VDSL connection at home on a Raspberry Pi 2 [19].
Prometheus has limits for storing metrics. The storage.local.retention
command-line option states how long it waits before deleting the acquired data. The default is two weeks. If you want to store data for a longer period of time, you can forward individual data series via federation to a Prometheus server with a longer retention period.
The options for storing data in other time series databases like InfluxDB or OpenTSDB are in a state of flux. The developers also are currently replacing the existing write functions with a generic approach that will enable read access. For example, a utility with read and write functionality has been announced for InfluxDB.
The official best practices by the Prometheus authors [20] is very useful reading that will help you avoid some problems up front. If you provide Prometheus metrics yourself, make sure you specify the time values in seconds and the data throughput in bytes.
Data Retrieval with PromQL
The PromQL query language retrieves data from the Prometheus time series database. The easiest way to experiment with PromQL is in the web interface (Figure 2). In the Graph
tab you can test PromQL queries and look at the results. A simple query for a metric provided by node_exporter
, such as
node_network_receive_bytes
has several properties, including, among other things, the names of the network interfaces. If you only want the values for eth0
, you can change the query to:
node_network_receive_bytes{device='eth0'}
The following query returns all acquired values for a metric for the last five minutes:
node_network_receive_bytes{device='eth0'}[5m]
At first glance, the query
rate(node_network_receive_bytes{device='eth0'}[5m])
looks similar, but far more is going on here: Prometheus applies the rate()
function to the data for the last five minutes, which determines the data rate per second for the period.
This kind of data aggregation is one of Prometheus' main tasks. The monitoring software assumes that instrumented applications and exporters only return counters or measured values. Prometheus then handles all aggregations.
The data rates of all network interfaces are found for the individual interfaces with the query:
sum(rate(node_network_receive_bytes[5m])) by (device)
These examples only scratch the surface of PromQL's possibilities [21]. You can apply a wide variety of functions to metrics, including statistical functions such as predict_linear()
and holt_winters()
.
Sound the Alarm
Prometheus uses alerting rules to determine alert states. Such a rule contains a PromQL expression and an optional indication of the duration, as well as labels and annotations, which the alert manager picks up for downstream processing. Listing 1 shows an alert rule that returns true if the PromQL expression up == 0
is true for a monitoring target for more than five minutes. As the example shows, you can also use variables in the comments.
Listing 1
Alerting Rule
01 # Monitoring target down for more than five minutes 02 ALERT InstanceDown 03 IF up == 0 04 FOR 5m 05 LABELS { severity = "page" } 06 ANNOTATIONS { 07 summary = "Instance {{ $labels.instance }} down", 08 description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.", 09 }
If an alert rule is true, Prometheus triggers an alarm (is firing
). If you specified an alert manager, Prometheus forwards the details of the alert to it via HTTP. This happens while the alert condition for the alert exists and restarts whenever the rule is evaluated; thus, letting Prometheus notify you directly would not be a good idea.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)