Storage monitoring with Grafana
Painting by Numbers
Performance values as plain old numbers do not present a visually appealing overview of system performance, but graphical dashboards can help you visualize what would otherwise be boring metrics. A number of free applications visualize metrics in almost any form desired, and one of the most popular open source tools in this family is Grafana [1]. Without much programming knowledge, you can build dashboards to present Internet of Things (IoT) values, stock prices, or the performance data of monitored systems. In this article, I show you how to use Grafana in a convenient GUI to display storage performance values, as well as how to retrieve the desired Simple Network Management Protocol (SNMP) data from InfluxDB.
SNMP, the source for the performance information in this example, is supported by all common operating systems and networked devices. This example uses Synology network-attached storage (NAS) as the data source. However, because the queries only use entries from the Management Information Base (MIB) v2 standard, the example will also work with other Linux-based NAS and storage area network (SAN) devices or commercial storage systems. However, before Grafana can visually evaluate performance data, other tools are needed to collect the data and store the results in a usable way.
InfluxDB Database
Performance data can best be saved in time series databases (TSDBs), which automatically reduce the volume of acquired data by reducing the accuracy of the measurements over time according to appropriate rules. For example, a metric database retains up-to-the-minute readings for several days, but then only hourly readings for older data, and after a further period of time, only one value per day. Popular TSDBs include RRDtool, Prometheus, and InfluxDB.
InfluxDB [2] is a very simple open source TSDB with a simple HTTP API on port 8086 and an SQL-flavored query language. Anyone who has ever worked with SQL databases will find InfluxDB very easy to use. Also, the API does not require too much programming knowledge. To store stock market prices in InfluxDB, for example, all you need is a simple Bash script that retrieves the price data from a website up front and then saves the data in the database:
curl -i -XPOST 'http://localhost:8086/write?db=stockprices' --data-binary "Stock,Symbol=$symbol value=$price"
This command adds a share price value
for Symbol
to the Stock
table in the stockprices
database. InfluxDB itself writes the current timestamp to the entry. In principle, even a simple Bash script would be sufficient to query performance values with snmpget
, then filter them with awk
, and write them to the database with a similar curl
statement.
However, InfluxDB has some helpful tools to make this operation easier. In addition to its own visualization front end Chronograf, the Telegraf data importer collects information with various import plugins and sends the data to InfluxDB. Of course, the import plugins also include an SNMP grabber.
Setting Up the Grafana Host
In this example, I used a virtual machine with CentOS 7 as the Grafana host. Strictly speaking, the test setup does not even rely on a full-fledged virtual machine, but uses an LXC container with CentOS 7, which is absolutely fine. Manufacturers InfluxData and Grafana Labs both provide RPM repositories for their tools. After a minimal CentOS 7 installation (and the yum update -y
command), you should create two repository files, influxdb.repo
and grafana.repo
, in the /etc/yum.repos.d
directory, the contents of which are shown in Listing 1. To install the required tools, enter:
Listing 1
Repository Files
[influxdb] name = InfluxDB Repository - RHEL $releasever baseurl = https://repos.influxdata.com/centos/$releasever/$basearch/stable enabled = 1 gpgcheck = 1:x gpgkey = https://repos.influxdata.com/influxdb.key [grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpmrepo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
yum install net-snmp net-snmp-utils grafana telegraf influxdb
For this example, neither InfluxDB nor Grafana require special configuration options and can be started with the default values:
systemctl enable influxdb systemctl start influxdb systemctl enable grafana-server systemctl start grafana-server
InfluxDB (on port 8086) and Grafana (on port 3000) are now available. No firewall is deployed in the demo environment. The default login to Grafana is the admin
account with a password of admin
. In a production environment, you would need to secure access to InfluxDB with an account, a password, and SSL; then deploy a reverse HTTP proxy with SSL termination upstream of Grafana and unlock the required ports with the firewall-cmd
command.
Configuring Telegraf
Before Telegraf can start reading data via SNMP, the SNMP protocol needs to be running on the storage system and serving the public
community. To choose another community, you change the name in the configuration. The insecure SNMPv1/v2 protocol versions only allow read access, so to change the configuration of devices over SNMP, you have to use the secure but more complex SNMPv3. To enable SNMPv2 on a Linux system, simply install the appropriate package with the SNMP daemon and add an entry for rocommunity public
as the only line in the /etc/snmp/snmpd.conf
file. Of course, the firewall must allow UDP access to port 161.
Now you have to decide which SNMP data you want to collect. The snmpwalk
command-line tool queries and displays parts of the SNMP MIB of a target system. The Synology NAS is running on IP address 192.168.2.6 and responds to SNMP queries over protocol version 2c of the SNMP public
community. The command
snmpwalk -v 2c -c public 192.168.2.6
first outputs the complete SNMP MIB of the storage system, but in this example, I only want to look at selected values, and I want Telegraf to collect the following metrics:
- System name
- Network traffic
- Disk I/O
- Memory usage of the filesystems
- CPU and RAM utilization (optionally)
The name of the system is important, because you will not want to run the SNMP query against one system only. Later, you will want to display the data for several systems in different views on the Grafana dashboard. For the purposes here, a new /etc/telegraf/telegraf.conf
(Listing 2) first tells Telegraf to request new data every 30 seconds and store the data in the telegraf
database on the local InfluxDB instance. If the database does not yet exist, Telegraf will create it. The configuration for the SNMP data source is:
[[inputs.snmp]] agents = [ "192.168.2.6:161" ] version = 2 community = "public" name = "snmp"
Listing 2
Requesting New Data
#Telegraf Configuration, Collect SNMP [agent] interval = "30s" round_interval = true ** [outputs] [outputs.influxdb] <font color="#ffff00">-=http://localhost:8086=- proudly presents database = "telegraf"
Several systems can be specified in the agents
line, which Telegraf then queries sequentially. This information is then followed by the queries for individual MIB values, all of which later end up in the snmp
table:
[[inputs.snmp.field]] name = "hostname" oid = "RFC1213-MIB::sysName.0" is_tag = true [[inputs.snmp.field]] name = "RAMFree" oid = "1.3.6.1.4.1.2021.4.6.0"
The SNMP object identifier (OID) can be specified both numerically and by its name. The hostname
field is also assigned the is_tag
modifier, which defines the field as a table index, making it easier to use in later queries. In this example, Telegraf queries the OID UCD-SNMP-MIB::memAvailReal
as the only memory value. It is one of the few metrics that actually refers to the available RAM. Most other memory values only provide information about the virtual memory (i.e., RAM plus swap). If you want the exact details, you can of course query all memory metrics. Optionally, Telegraf can also retrieve the CPU load:
[[inputs.snmp.field]] name = "CPUsystem" oid = "1.3.6.1.4.1.2021.11.10.0"
This worked for the old Synology NAS in our lab, which only uses a single-core Atom CPU. Modern systems require further requests for all cores (10.0, 10.1, 10.2, etc.). However, this test does without the direct CPU values and later prefers to query the OS value "System Load"
, which indicates the load of the system independent of the number of cores.
Buy this article as PDF
(incl. VAT)