Nagios Passive Checks
Smooth Check
Suppose you have developed a magnificent script (a Bash script, for example). This script will execute every night to dump a MySQL database or to rsync your valuable files to another server.
A common strategy is to use email as the way to get notified about the script results. If you wrote a good script, and your systems are very strong, the script will never exit with an error condition, and every morning you will find a message stating OK: no error running my beautiful backup script in your inbox.
This message is not a problem – as long as you only have one script sending notifications one time in a day. Suppose, however, that you have many scripts, with many backups, and all these scripts are notifying you via email. Your mailbox will fill up every morning with the same message. The first day, you will be impressed, but soon you will lose interest in reading all those identical messages. Your next step will be to create a filter in your mail client to put these messages in a subfolder, and after that you won't read them anymore, because you have basically just sent yourself a lot of spam. The worst part is, in the flood of useless emails, you risk of not seeing a real error notification that you really should be reading.
One solution is to send a message only in the case of an error; however, if you configure your script to send an email if it encounters an error during the execution, you have no guarantee that the script actually executed. If the script crashes or aborts before the line that sends the email notification, you will never know.
A far better solution is to let Nagios [1] listen for the notifications and only notify you if an error occurs or a message indicating success is not received.
Nagios will help you:
- Avoid having to manually process useless spamming notifications ("script execution ok" messages)
- Receive execution error or warning notifications
- Find out if the script did not complete successfully
The Nagios passive check technique described in this article uses Nagios Service Check Acceptor (NSCA) [2]. This article assumes you have a working knowledge of Nagios. If you are new to the Nagios network monitoring system, see the resources at the Nagios website.
Getting Started
I'll start by defining a host (Listing 1). This host can be the server where your scripts runs, the server involved in the backup, or a dummy host (in Nagios, you can define a host and associate a check_command
with that host).
Listing 1
Host Definition
01 define host { 02 host_name yourserver 03 alias yourserver 04 address 192.168.0.100 05 check_command check-host-alive 06 contact_groups contacts 07 use check_5min_24x7,notify_24h_24x7 08 }
This article assumes you have some background with Nagios configuration, but I'll start with a little refresher for those who haven't tried it in a while. Nagios lets you define the objects you want to monitor, like hosts and services. Instead of issuing repetitive directives for each object, you can set up templates to use in common situations.
As you can see in Listing 1, I use the check templates check_5min_24x7
and notify_24h_24x7
. The check_5min_24x7
template (Listing 2) checks whether the host is alive (the check-host-alive
command uses a ping) every five minutes, for the period defined in the check_period
attribute (in this case, 24x7). The 24x7
time period template (Listing 3) tells Nagios to perform the related action (a check or a notification) every day of the week at any hour. (If you do not want to check that a host is alive, you can define a check template that always returns OK (see the Nagios check_dummy
plugin.)
Listing 2
check_5min_24x7
01 define host { 02 name check_5min_24x7 03 register 0 04 max_check_attempts 3 05 check_interval 5 06 retry_interval 1 07 active_checks_enabled 1 08 passive_checks_enabled 1 09 check_freshness 1 10 freshness_threshold 1800 11 check_period 24x7 12 check_command check-host-alive 13 }
Listing 3
24x7
01 define timeperiod { 02 timeperiod_name 24x7 03 alias 24x7 04 sunday 00:00-24:00 05 monday 00:00-24:00 06 tuesday 00:00-24:00 07 wednesday 00:00-24:00 08 thursday 00:00-24:00 09 friday 00:00-24:00 10 saturday 00:00-24:00 11 }
The template notify_24h_24x7
(Listing 4) defines notification behavior, and notify_24h_24x7
tells Nagios how to behave in the case of a state change (i.e., the script exits with an error, and the state changes from 0 OK
to 3 CRITICAL
). This configuration tells Nagios, every day of the week, at any hour of the day (as seen for the 24x7 timeperiod in Listing 4), to send a notification; then, if no further state change occurs, to wait a period of 86400
seconds (24 hours), as defined in the notification_interval
, before sending another notification (in this case, another email).
Listing 4
notify_24h_24x7
01 define service { 02 name notify_24h_24x7 03 register 0 04 notification_interval 86400 05 notification_options w,u,c,r,f,s 06 notification_period 24x7 07 }
The next step is to define a service template (Listing 5). This template will use the freshness_threshold
option to raise an alert if Nagios does not receive any notification from your script over a period of 93600
seconds (26 hours). Suppose your script is executed by cron every day at 1:00am, that is, every 24 hours: The 26-hour threshold gives the script two hours to complete. (Obviously, you must adjust this period for your own situation.)
Listing 5
Service Template
01 define service { 02 name check_passive_26h_24x7 03 register 0 04 max_check_attempts 1 05 check_interval 1 06 retry_interval 1 07 active_checks_enabled 0 08 passive_checks_enabled 1 09 notifications_enabled 1 10 check_freshness 1 11 freshness_threshold 93600 12 check_period 24x7 13 }
The next step is to define a service template related to the notification (Listing 6). This template defines how often to send notifications in case of warning or critical status: Once a day is sufficient.
Listing 6
Notification Template
01 define service { 02 name notify_24h_24x7 03 register 0 04 notification_interval 1440 05 notification_options w,u,c,r,f,s 06 notification_period 24x7 07 }
Now define a service related to your script (Listing 7).
Listing 7
Service Template for the Script
01 define service { 02 service_description Powerful_backup 03 check_command passive_backup!2!"Warning: no passive check received in the expected period" 04 host_name yourserver 05 contact_groups contacts 06 flap_detection_enabled 0 07 event_handler_enabled 0 08 use check_passive_26h_24x7,notify_24h_24x7 09 }
The command in Listing 8 points to a script in the $USER1$
directory (in the Debian package, this directory is /usr/lib/nagios/plugins/
).
Listing 8
Command Template for Passive Check
01 define command { 02 command_name passive_backup 03 command_line $USER1$/nobackupreport.sh $ARG1$ $ARG2$ 04 }
The script in Listing 9 simply prints the string you pass to it and exits with the exit status you pass to it. (In this example, the script will print All ok and it will exit with 0 , the OK exit code for Nagios plugins).
Listing 9
nobackupreport.sh
01 #!/bin/sh 02 03 status=$1 04 shift 1 05 06 /bin/echo $@ 07 08 exit $status
Now you have to install and start the NSCA service. In Debian, that's:
apt-get install nsca /etc/init.d/nsca start
For simplicity, I use the default /etc/nsca.cfg
configuration file. You can define the decryption method (just obfuscation by default) and the listening port (5667 by default). This service will listen for incoming passive checks. Now, on the machine where your powerful backup script resides, you must install the nsca-client
package (check your distribution or operating system). You must edit your script to add the section related to Nagios. Pipe to the NSCA client command a string in the form:
"host;service;state;message"
where host
is the hostname, service
is the service name previously defined in the Nagios configuration, state
is a Nagios status code (
OK, 1
warning, 2
critical), and message
is the message that will appear in the notification (on the Nagios web page as well in the email message).
Listing 10 is a Bash script showing the passive check.
Listing 10
Passive Check Bash Script
01 #!/bin/bash 02 03 mysqldump and so on 04 05 EL=$? 06 07 if [ $EL -ne 0 ] 08 then 09 MESSAGE="Problem with mysqldump" 10 STATE=2 11 else 12 MESSAGE="Backup OK" 13 STATE=0 14 fi 15 16 echo "yourserver;Powerful_backup;$STATE;$MESSAGE" | /usr/sbin/nsca \ -H yournagiosserverIP -p 5667 -d ";" -c /etc/send_nsca.cfg
Conclusion
You can tweak this passive check configuration for your own needs. This technique offers a useful way to handle notifications and checks coming from your scripts. You can eliminate spam messages, reduce the risk of losing important notifications, and make sure your scripts are really executed.
The NSCA client is packaged for all Linux distributions, Solaris, and other Unix variants; you also can use it in a SmartOS global zone without having to install a package. The send_nsca
utility is also available for Windows. You can use the JSend NSCA Java API to send Nagios passive checks from within your Java applications, and APIs also exist for other languages, such as Ruby, PHP, and Perl.
If you don't want to take the time to learn the details of Nagios manual configuration, you might want to experiment with Nagios web GUI configuration tools, such as NConf [3].
Infos
- Nagios: http://www.nagios.org/
- Nagios Service Check Acceptor (NSCA): http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCA--2D-Nagios-Service-Check-Acceptor/details
- NConf: http://www.nconf.org/
Buy this article as PDF
(incl. VAT)