« Previous 1 2 3 4 Next »
How to query sensors for helpful metrics
Measuring Up
Monitoring Hard Disks
You need to pay special attention to monitoring your hard disks. It does not matter whether they are old-fashioned mechanical drives or modern SSDs; you have a number of important metrics to capture. Of course, temperature plays a big role in mechanical drives. If the server does not have a separate temperature sensor, you can use the disk temperature to draw conclusions about the room temperature in the data center itself. If, for example, the air conditioning in a server facility fails, this quite quickly becomes apparent through the rise in disk temperature. Both legacy disks and SSDs have a SMART function that attempts to predict mass storage defects and failures. Information on read and seek errors is important, indicating problems with individual drives or the complete array.
Two main tools can discover information about drive temperatures on Linux servers: hddtemp
and smartctl
. As the name suggests, hddtemp
returns only the temperature of a drive, whereas smartctl
also returns data on error rates and SMART status. However, SMART values should be taken with a grain of salt, because different hard drive manufacturers give you very different values. In my lab setup, for example, I get consistently high values for seek and read errors from Seagate Enterprise disks, whereas Toshiba drives only return 0
. Seagate outputs the raw error rates before internal error correction.
The SMART specification does not specify exactly what the output values return. Because both tools require root privileges to query the information, metrics collectors such as Telegraf, which runs in userspace, can only access the data in a roundabout way. I'll talk about this later when it comes to sending the collected values to Influx. Type
dnf install hddtemp dnf install smartmontools
to install the tools on an EL8 system.
Intelligent Platform Management with IPMI
Server manufacturers typically build a baseboard management controller (BMC) chip into their servers, which then provides an Intelligent Platform Management Interface (IPMI). This interface gives you very detailed information about the system, which the BMC collects independent of the operating system. Depending on the implementation, the IPMI can be addressed on the local system or on the LAN. What can be retrieved by IPMI depends on the BMC used and differs depending on the vendor and age of the server. I used different systems for this article: an Intel NUC and a cloud server by Hetzner, which does not have a BMC.
An older HP Gen8 microserver at least provides a few temperature values for the RAM, chipset, and case, whereas its big brother, the DL380 Gen8, reveals the current power consumption of the power supply unit and reams of temperature data from every imaginable place in the chassis. Newer Dell and Fujitsu servers also report very detailed fan speeds as well as chipset, DIMM, and processor voltages and currents. You need ipmitool
to access the local IPMI, which with the commands
dnf install ipmitool ipmitool sdr elist
sets up and lists available sensors and values.
Collecting Data for the UPS
An uninterruptible power supply (UPS) is essential in any server facility. Depending on the design and monitoring port, you can also glean valuable information about the condition of this device. It is not just about battery life and utilization. The voltage curve also provides important information. Machine crashes or the unexpected death of a component (e.g., a hard disk) are often accompanied by voltage fluctuations in the mains supply.
Large UPSs with a LAN management adapter disclose this basic and other information by way of the Simple Network Management Protocol (SNMP) – but more of that in a moment. A simple UPS comes without a LAN interface, but with a serial or USB interface for monitoring. Often the Network UPS Tools (NUT) suite is then used to monitor the UPS and organize controlled shutdowns of several systems on the LAN. The upsd
UPS daemon from the NUT toolset permanently contacts the UPS system and continuously queries the metrics. If you have the right metrics grabbers, this information can also be routed to Influx and the Grafana dashboard.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.