Getting started with Prometheus
Watching the Shop
Whether you are managing several multicloud-hosted machines or just a home server for hobby purposes, one thing matters: Stuff has to work. If it doesn't, you must have a reactive way to become aware of it. A manual check is daunting and bordeline senseless.
Observability is all about having a complete overview of and ever watchful eye on your infrastructure. The legacy approach of installing agents here and there has shown its flaws, especially when it comes to observing infrastructure data that the server application isn't designed to monitor.
A Metrics-Oriented Approach
In Greek mythology, Prometheus is the Titan that gifted fire to humanity, but in this article, Prometheus is an open source application that, since its first release in 2016, has offered a new way to monitor and observe infrastructure by collecting metrics of its monitoring targets in a scheduled manner and storing them in a time series database (TSDB). (See the "What Is a Metric?" box.) Once the metrics are stored and available, it allows for in-depth querying and timely evaluation through a proprietary query language called PromQL [1]. Such queries can be used to create alerts rules that notify you whenever your target behavior drifts from the expected.
What Is a Metric?
A metric is a value produced by a system at a specific instant. Typically, the value represents the same information collected at various points over a period of time. A classic example would be noting the temperature and the time each time you read your home thermometer. At the end of the day, you could put that data on a graph with time on the x -axis and temperature on the y -axis and have yourself a nice graph. This approach is also tremendously useful when observing IT infrastructures because it provides functionality insights and highlights concerns.
As the name suggests, PromQL is a query language that can be employed to extract information from the Prometheus TSDB. A common structure is the use of a metric name (a value) together with one or more labels (to which the value belongs). The following example will extract available disk space from target_server
node_filesystem_free_bytes{instance="target_server:9100"}
Metrics can be combined through the use of specific functions or arithmetic operators. For example, you could divide file_system_free_bytes
by file_system_avail_bytes
to get a free disk space percentage.
A detailed introduction to PromQL is available on the official Prometheus website [1].
Monitoring
In a real scenario, your goal will be to monitor a target machine hosting a web application. Such an application will also be connected to a local relational database management system – MariaDB, in this example. The requirement is to visualize information and be alerted if:
- Disk space of the host machine is running out.
- HTTP requests to the application are not being fulfilled.
- Database service becomes unreachable.
Prometheus collects metrics by executing timed HTTP GET requests to defined targets (the default is 15 seconds). It then expects to find them in a specific format (Listing 1) that is either exposed natively by applications or through the use of side services, which Prometheus calls exporters . (See the "Our Name Is Exporters, for We Are Many" box.)
Listing 1
Metrics
# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes. # TYPE node_filesystem_avail_bytes gauge node_filesystem_avail_bytes{device="/dev/nvme0n1p1",fstype="vfat",mountpoint="/"} 7.7317074944e+11 node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/tmp"} 1.6456810496e+10 # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 71039.6 node_cpu_seconds_total{cpu="0",mode="iowait"} 54.79 node_cpu_seconds_total{cpu="0",mode="irq"} 865.53 node_cpu_seconds_total{cpu="0",mode="nice"} 187.19 node_cpu_seconds_total{cpu="0",mode="softirq"} 1029.12 node_cpu_seconds_total{cpu="0",mode="steal"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 2991.27 node_cpu_seconds_total{cpu="0",mode="user"} 4890.72
Our Name Is Exporters, for We Are Many
Prometheus expects metrics to be exposed by an HTTP endpoint and in a specific text format. Many applications are natively configurable to expose Prometheus-compliant metrics (e.g., HAProxy), but many others are not.
Luckily, a huge list of official and community-driven applications will connect to the service you would like to monitor and, at the same time, expose such information in the correct format. A comprehensive list of these exporters can be found on the official Prometheus website [2].
Node Exporter for Machine Metrics
In the example in this article (Figure 1), I don't start the deployment with the Prometheus service itself but instead start Prometheus once all of the monitoring targets are ready.
The first thing to do is install your first exporter. Node Exporter (no relation to NodeJS; all exporters are written in Go) is meant to be installed on a Linux machine and run as a local process to fetch system information such as free memory, network usage, and disk space.
To begin, install it on the web app hosting machine:
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz -O node_exporter.tar.gz tar -zxf node_exporter.tar.gz ./node_exporter &
TCP port 9100 will be opened, which you can verify by cURLing it:
curl http://target_server:9100/metrics
The output will be a (rather long) list of strings and values that represent information about the target_server
at this instant.
Buy this article as PDF
(incl. VAT)