« Previous 1 2 3 Next »
Storage monitoring with Grafana
Painting by Numbers
Reading Complete SNMP Tables
SNMP organizes various items of system information in tables, which is quite practical for my purposes because Telegraf can retrieve complete SNMP tables in a single action. With a networked storage system, administrators naturally want to know exactly how many gigabytes are coming in and going out over the network interfaces. Telegraf therefore collects the complete network table:
[[inputs.snmp.table]] name = "if" inherit_tags = [ "hostname" ] oid = "IF-MIB::ifXTable" ** [[inputs.snmp.table.field]] name = "ifName" oid = "IF-MIB::ifName" is_tag = true
The ifName
field is a table index, which makes it easy later to display the values of the various network interfaces separately. This example could also be used to monitor a managed network switch. The IF-MIB
then lists all switch ports and their loads, and it works for Fibre Channel switches, as well. The input
snmpwalk -v 2c -c public 192.168.2.6 IF-MIB::ifXTable
shows the complete table content, including the names of the interfaces and many different counters for packets sent and received.
On the basis of the same pattern, Telegraf will also import the disk I/O values into InfluxDB, which are also organized in a standard MIB table; the device name later acts as an index. The input/output operations per second (IOPS) here are far more interesting than the throughput (MBps) per disk. Bandwidth bottlenecks in sequential data transfer are primarily caused by the network connection. The disks, on the other hand, with limited IOPS, cause problems in the case of many small instances of random access, such as database queries or simultaneous access by different clients. The entries are thus:
[[inputs.snmp.table]] name = "diskio" inherit_tags = [ "hostname" ] oid = "UCD-DISKIO-MIB::diskIOTable" ** [[inputs.snmp.table.field]] name = "DiskName" oid = "UCD-DISKIO-MIB::diskIODevice" is_tag = true
Telegraf retrieves information on disk allocation and system load from two further standard SNMP tables (Listing 3).
Listing 3
Disk and Load Requests
[[inputs.snmp.table]] name = "diskusage" inherit_tags = [ "hostname" ] oid = "HOST-RESOURCES-MIB::hrStorageTable" ** [[inputs.snmp.table.field]] name = "VolumeName" oid = "HOST-RESOURCES-MIB::hrStorageDescr" is_tag = true ** [[inputs.snmp.table]] name = "load" inherit_tags = [ "hostname" ] oid = "UCD-SNMP-MIB::laTable" ** [[inputs.snmp.table.field]] name = "loadtime" oid = "UCD-SNMP-MIB::laNames" is_tag = true
For the moment, all desired values for visualization are in Grafana where needed. Because InfluxDB does not require a rigid database structure, you can add more tables or single values to the configuration later on. Additionally, you can use this configuration, as mentioned above, to query data from several systems. The
inherit_ tags = [ "hostname" ]
entry tells InfluxDB queries to select values as a function of the system, but more about this later.
To check whether the Telegraf configuration actually works, first issue the telegraf -test
command. The tool then parses the configuration, executes the queries, and displays the results at the command line. You can check whether the results suit your needs and, if not, change the queries. If everything is fine, enter
systemctl restart telegraf systemctl enable telegraf
to start the service and deliver fresh metrics to InfluxDB every 30 seconds.
Creating Custom Dashboards
The user interface tool works on a simple principle: data sources with information on the one hand and visualizations that display data from the sources on the other. Grafana combines several visualizations in dashboards and has a simple user and rights system, as well. Therefore, you can restrict access to dashboards to individual groups and users. Here, however, I will not be looking at access controls.
A newly installed Grafana first requires a new password and the first data source. In this case, InfluxDB is on http://localhost:8086, Access: Server (Default) with the telegraf database, which does not require a username and password. I quickly create a new dashboard named Synology and get started with an initial visualization task showing the network traffic (Figure 1).
On the dashboard the Add Panel
button in the top starts the dialog for the new visualization, which first wants information about the query. Grafana does not require manual input; rather, it relies on point-and-click in the Query Builder, which greatly simplifies even the more complex database queries. The query starts with the FROM
statement. The first and only data source is also the default
system. The table is simply named if
in the Telegraf configuration set up earlier, and the WHERE
selection filters for host and interface names. The NAS in the test goes by the name fatbox
and has two LAN ports, of which only eth1
is attached to the switch. The selection is therefore:
FROM default if WHERE ifName = eth1 AND hostname = fatbox
For the SELECT
statements, Grafana now only suggests the fields that match the FROM
filter criteria. SNMP does not provide values in megabits per second, but simply counts the incoming and outgoing network octets (bytes) in 64-bit counters. The following selection is required for a value in bits per second:
SELECT field(ifHCInOctets) mean() derivative(1s) math(*8) alias(IN)
The ifHCInOctets
field is a 64-bit integer that returns the number of incoming octets; the derivative(1s)
function calculates the change from second to second. With new values only every 30 seconds, mean()
determines the mean value between the last data points, and math(*8)
converts the octet (=byte) per second into a bit per second value. The alias(IN)
is only used for cosmetic reasons so that the legend for the graph reads if.IN
.
To display the OUT
value, as well, simply click on the +
at the end of the query and scroll to fields/field
. Grafana then duplicates the existing query into a second SELECT
query. This second line is then assigned the (ifHCOutOctets)
field and the alias (OUT)
:
SELECT field(ifHCOutOctets) mean() derivative(1s) math(*8) alias(OUT)
Now the visualization will show the incoming and outgoing network traffic, but the two graphs overlap. To make this a little more clear cut, simply assign the math()
entry of the OUT
graph a math(*-8)
entry instead of math(*8)
. Grafana now visualizes the OUT
traffic in a far more intuitive graph as a negative value in the downward direction.
For up-to-date values at all times, you can set the displayed time span and refresh interval in the upper right corner of the dashboard. In this early phase, Last 1 hour Refresh every 30s is recommended.
Grafana shows the section icons for further graph configuration to the left of the query. From the Visualization icon, you can set the graphic type and the display options. The default values will normally be fine for line graphs. In the Axes | Left Y section, you can define the measurement Unit ; in this example it is Data Rate – bit/s . The General tab has a field for the visualization name, which is then saved by pressing the Save Dashboard icon at the top of the screen.
Refining the Display
Using the same procedure, you can create a second panel with the disk IOPS (Figure 2). Choose whether you want to monitor the values of all physical disks separately (i.e., sda, sdb, sdc). However, the monitoring example here only considers the multidisk device dm-1 (i.e., the software raid that the Synology NAS has created from the disks). Depending on the configuration of your network storage, completely different device names will appear here.
Those with iSCSI target services and a block back end will find their iSCSI target devices listed separately as dm-2 , dm-3 , and so on. iSCSI targets in file mode, on the other hand, save the virtual disk as a file, whose IOPS appear as part of dm-1 and cannot be monitored separately. The query for IOPS is the same:
FROM default diskio WHERE DiskName=dm-1 AND hostname = fatbox SELECT field(diskIOReads) mean() derivate(1s) math(*-1) alias(ReadIO)
The value read here extends in the downward direction because of the math(*-1)
entry, so that it visually reflects the OUT
network. The second graph, similar to the first, uses diskIOWrites
and the alias(WriteIO)
field and omits the math()
field. Now just fine tune the appearance and the second panel is done and dusted.
The last item of the Graph section lets you create alerts. Grafana can then alert in various channels if the monitored values drop below or climb above a certain value over a defined period of time. This is not necessary for I/O values. However, values such as temperatures, fan speeds, or UPS battery status can be called up with this SNMP setup – then the alert function makes sense. In addition to good old email, Grafana can control messengers such as Slack, Telegram, and Discord by configurable notification channels.
To display the fill level of the NAS as a percentage, you need a more complex query and a nice Singlestat Panel (Figure 3). InfluxDB can perform mathematical calculations in the queries. For percent level of the NAS /volume1
filesystem, you need to divide the SNMP hrStorageUsed
value from hrStorageTable
by the total capacity hrStorageSize
and multiply the result by 100.
For queries that Grafana cannot create in the user interface, you can first create a more simple query with the graphical tools (e.g., only for hrStorageUsed
, but then switch from the graphical to the text query by pressing the eye icon). You will find the following query on doing so:
SELECT mean("hrStorageUsed" )/mean("hrStorageSize")*100 FROM "diskusage" WHERE ("VolumeName"='/volume1') AND ("hostname"='fatbox') AND $timeFilter GROUP BY time($__interval) fill(previous)
Unlike the user interface query, the text query displays the required InfluxDB syntax with parentheses and quotes. You can check immediately whether your manual edits to the query actually work with the Query Inspector, which displays the complete output of a query onscreen, including all error messages.
To display this query appropriately, select the Singlestat option in the Visualization section. To make it pretty, add the following values to the Value pane: Stat: current ; Unit percent (0-100) ; Threshholds: 50,80 with the colors green, yellow, and red; and Gauge: Show , with Threshhold Markers checked. These markers change the color of the graph if the values specified are exceeded. This graph will show red as of 80 percent occupancy of the data carrier. The Stat: current entry forces the display to show the latest acquired value. If the Stat: avg default is kept, the graph shows the average value over the period selected top right.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)