Admin as a service with sysstat for ex-post monitoring
Facts, Figures, and Data
The sysstat package contains numerous tools for monitoring performance and load states on systems, including iostat
, mpstat
, and pidstat
. In addition to tools intended for ad hoc
use, sysstat comes with a collection of utilities that you schedule for periodic use as cron jobs (e.g., sar
). The compiled statistics range from I/O transfer rates, CPU and memory utilization, paging and page faults, network statistics, and filesystem utilization through NFS server and client activities.
Of course, you could use top
, vmstat
, ss
, and so on to determine the data, but bear in mind that system events are hardly likely to be restricted to the times at which you are sitting in front of the screen. Admins typically receive requests for more detailed information, like: "What exactly happened on system XY between 1:23 and 1:37am?" Many sensors in modern monitoring systems are capable of detecting these anomalies, but very few IT departments are likely to have configured comprehensive monitoring for all systems. Moreover, these metrics quite commonly are not implemented in a central monitoring setup. Sysstat gives you the perfect toolbox for these cases.
Installation
To install sysstat under Ubuntu and Debian, just run:
sudo apt install sysstat
By default, sysstat checks the system every 10 minutes with the use of a systemd timer. If you require more frequent measurements, you need to adjust the interval in the /usr/lib/systemd/system/sysstat-collect.timer
file. You can set a scan every five minutes by entering *:00/05
instead of the default setting *:00/10
:
[Unit] Description=Run system activity accounting tool every -10- 5 minutes [Timer] OnCalendar=*:00/05 [Install] WantedBy=sysstat.service
Next, tell systemd that the configuration has changed by typing:
systemctl daemon-reload
To enable and start monitoring with sar
, you also need to open the /etc/default/sysstat
file and change the ENABLED="false"
setting to ENABLED="true"
. You can enable automatic launching of the service at boot time and launch it directly with:
systemctl enable sysstat systemctl start sysstat
A call to
systemctl status sysstat
should now return an active message.
sysstat at Work
Sysstat writes the measured values in a binary-encoded format to files in the /var/log/sysstat/
folder. The utility has a file-naming scheme that uses the day (i.e., sa<DD>
with the -o
option) or the year, month, day (i.e., sa<YYYYMMDD>
with the -D
option) and always accesses the current file when reading.
If you launch sar -q
, the tool displays a brief overview of the configuration. The overview contains a line with system data such as kernel, hostname, and date. Another line indicates when sysstat-collect.timer
was last restarted. Finally, one line per sample shows measured values (e.g., the size of the run queue and the process list) as well as three calculated values for load average.
Ex-Post Analysis
In this article, I focus on accessing ex-post data (i.e., the state of a system in the past). To access the data for the last two days, sar
offers the -1
(yesterday) and -2
(day before yesterday) switches. Regardless of the sensor in which you are interested, sar -1
gives you the previous day's data.
If the measurement data is from further back, you have no choice but to specify the file names with the -f
argument. If you explicitly want to view the statistics from January 8, you would define the source file as shown in the first line of Listing 1.
Listing 1
Analysis with sar
01 $ sar -f /var/log/sysstat/sa08A [...] 02 $ sar -f /var/log/sysstat/sa08 -s 23:30:00 -e 23:55:00 [...] 03 $ sar -f /var/log/sysstat/sa07 -s 23:30:00 -e 23:55:00 -P ALL 04 $ sar -f /var/log/sysstat/sa07 -s 23:30:00 -e 23:55:00 -P 1,2
However, filtering by specific days is often not enough, so you can filter for a time interval by passing
-s [<hh>:<mm>:<ss>] -e [<hh>:<mm>:<ss>]
for the start and end times. The command in line 2 of Listing 1 returns metrics between 23:30:00 hours and 23:55:00 hours on January 8.
The command returns three data records in addition to the obligatory header line. My systemd timer is set to a 10 minute period, which explains the times of the samples. Other measured variables are user
, nice
, system
, iowait
, steal
, and idle
. Unsurprisingly, user
shows the proportion of the CPU load assigned to processes in user space. The same applies to nice
, with the restriction that only the processes with their own nice level are included.
The system
variable shows utilization by the kernel (i.e., the time required for hardware access or software interrupts). Waits, which can occur when accessing hardware, are listed under iowait
, and steal
records the share of involuntary waits that occur, for example, if you are looking at a virtual machine and the hypervisor distributes CPU time differently. Finally, idle
provides information about time when CPUs are idle and no I/O operations are pending.
Without further details, sar
provides CPU utilization, but only as a total for all CPUs. If you want to view individual CPUs or cores, you can run sar
with the -P
switch. The command in Listing 1 line 3 displays the status of all cores, and the command in line 4 displays the status for cores 1 and 2 only.
Like top
, sar
displays the average load. In principle, this is an excellent indicator for assessing whether a system is scaled correctly. If the average system load is significantly greater than the number of CPU cores, you are looking at an overload scenario. However, if the value is mostly around zero, the system may be oversized and you could consider withdrawing resources.
As Figure 1 illustrates, sar
can output a considerable volume of detail about system load. The run queue (runq-sz
) shows hardly any additional load during the afternoon. The number of processes (plist-sz
) is quite high, at around 350, but this number is intentional, being mainly PHP server processes available as spare servers to process requests as quickly as possible. The machine has four cores and is exposed to very moderate utilization, with an average system load of 0.55.
Buy this article as PDF
(incl. VAT)