Monitoring HPC Systems

Nerve Center

Configuring

Some tweaks (edits) need to be made to the gmetad and gmond configuration files. The changes to /etc/ganglia/gmetad.conf are fairly easy. First, look for a line in the file that reads

data_source "my cluster" localhost

and change it to

data_source "Ganglia Test Setup" 192.168.1.4

where 192.168.1.4 is the IP address of the master node. You can use any name you want for your Ganglia cluster name, but I chose to make it "Ganglia Test Setup" as an example. The gmond configuration files also need to be modified slightly.

The following change needs to be made to the file /etc/ganglia/gmond.conf: A section in the file starts with cluster {. In that section, assign the variable name anything you like; just be sure it is in quotes and don't use any quotes or other unusual characters in the name itself. I changed mine like this:

name = "Ganglia Test Setup"

That's about it for configuring Ganglia on the master node. When I start adding Ganglia clients, I'll have to come back and edit gmetad.conf to add client IP addresses, but that happens later in the article. At this point, I have a choice: I can proceed with installing the Ganglia web interface, or I could test gmond to make sure it's collecting data on the master node. I tend to be a little more conservative and want to run a test before jumping into the deep end of the pool.

Testing gmond and gmetad

To run and test (debug) gmond from the command line, I'll run it "by hand," telling it that I'm "debugging." Sometimes this process produces a great deal of output, so I'll capture it using the script command (Listing 7). Remember to use Ctrl+C (^c) to kill gmond and then Ctrl+D (^d) to stop the script.

Listing 7

Testing gmond

[root@home4 laytonjb]# cd /tmp
[root@home4 tmp]# script gmond.out
[root@home4 tmp]# gmond -d 5 -c /etc/ganglia/gmond.conf
[root@home4 tmp]# ^c
[root@home4 tmp]# ^d

Take a look at the top of the file and you should see some output that looks like Listing 8, which indicates that gmond is working correctly. If everything is running correctly – at least as far as you can tell – then start up the gmetad and gmond daemons and make sure they function correctly (Listing 9). You should see the OK output from these commands (one from each). If you don't, you have a problem and should go back through the steps.

Listing 8

gmond Test Output

[root@home4 tmp]# gmond -d 5 -c /etc/ganglia/gmond.conf
loaded module: core_metrics
loaded module: cpu_module
loaded module: disk_module
loaded module: load_module
loaded module: mem_module
loaded module: net_module
loaded module: proc_module
loaded module: sys_module
loaded module: python_module
udp_recv_channel mcast_join=239.2.11.71 mcast_if=NULL port=8649 bind=239.2.11.71 buffer=0
socket created, SO_RCVBUF = 124928
tcp_accept_channel bind=NULL port=8649 gzip_output=0
udp_send_channel mcast_join=239.2.11.71 mcast_if=NULL host=NULL port=8649
Unable to find the metric information for 'procs_blocked'. Possible that the module has not been loaded.
Unable to find the metric information for 'procs_created'. Possible that the module has not been loaded.
Unable to find any metric information for 'softirq_(.+)'. Possible that a module has not been loaded.
        metric 'cpu_user' being collected now
[tcp] Starting TCP listener thread...
...

Listing 9

Starting Daemons

[root@home4 ~]# /etc/rc.d/init.d/gmond start
Starting GANGLIA gmond:                                    [  OK  ]
[root@home4 ~]# /etc/rc.d/init.d/gmetad start
Starting GANGLIA gmetad:                                   [  OK  ]

If you got two OK s, then you can also check whether the processes are running and the ports are configured correctly (Listing 10). Notice that port 8640 is in use, so everything's good at this point. Now I'm ready to install the web interface!

Listing 10

Checking Processes and Ports

[root@home4 ~]# ps -ef | grep -v grep | grep gm
nobody   21637     1  0 18:12 ?        00:00:00 /usr/sbin/gmond
nobody   21656     1  0 18:12 ?        00:00:00 /usr/sbin/gmetad
[root@home4 ~]# netstat -plane | egrep 'gmon|gme'
tcp       0      0 0.0.0.0:8651               0.0.0.0:*                   LISTEN      99         253012     21656/gmetad
tcp       0      0 0.0.0.0:8652               0.0.0.0:*                   LISTEN      99         253013     21656/gmetad
tcp       0      0 0.0.0.0:8649               0.0.0.0:*                   LISTEN      99         252721     21637/gmond
udp       0      0 192.168.1.4:47559          239.2.11.71:8649            ESTABLISHED 99         252723     21637/gmond
udp       0      0 239.2.11.71:8649           0.0.0.0:*                               99         252719     21637/gmond
unix  2     [ ]         DGRAM                   252725 21637/gmond

Web Interface

Ganglia has a second-generation web interface that is very flexible, including the ability to define your own charts. It uses RRDtool as the database for the charts, a common theme in the monitoring world.

You can download it from SourceForge [17] or get it from the Ganglia website. I will be using the latest version 3.5.12, which was the latest version at the time of writing. RRDtool requires HTTPD and PHP, so be sure you install those.

Download the compressed TAR file and uncompress and untar the file. The README for the tool points to a URL for installation instructions. For my installation, I edited Makefile and made just four changes:

(1) At the top of the file, change the GDESTDIR line to:

GDESTDIR = /var/www/html/ganglia

This is where the Ganglia web interface will be installed.

(2) Change the GWEB_STATEDIR line to:

GWEB_STATEDIR = /var/lib/ganglia-web

(3) Change the GMETAD_ROOTDIR line to:

GMETAD_ROOTDIR = /var/lib/ganglia

(4) Change the APACHE_USER line to:

APACHE_USER = apache

Once these changes are made, you can simply run make install to install the Ganglia web pieces. Now comes the big test. In your browser, open the URL for the Ganglia web page as http://192.168.1.4/ganglia (recall that in gmetad.conf I told it that the data source was 192.168.1.4). You should see something like the image in Figure 2. Notice that on the left-hand side of the image, near the top of the web page, that the number of Hosts up : is 1 and that it has eight CPUs. Plus, the charts are populated. (I took the screen capture after letting it run a while, so the charts actually had real data.)

Figure 2: Ganglia on my desktop.

Remember that the default refresh or polling interval is 15 seconds, so it might take a couple of minutes for the charts to show you much. Be sure to look at the data below the charts. If the values are reasonable, then most likely things are working correctly.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus