Monitoring HPC Systems
Nerve Center
When you know better, you do better – Maya Angelou
Monitoring clusters and understanding how the cluster is performing is key to helping users better run their applications and to optimizing the use of cluster resources.
Such information is valuable for a variety of reasons, including understanding how the cluster is being used, how much of the processing capability is being used, how much of the memory is being used for user applications, and what the network is doing and whether it is being used for applications. This information can help you understand where you need to make changes in the configuration of the current cluster to improve the utilization of resources. Moreover, this information can help you plan for the next cluster.
In a past blog post, I looked at monitoring from the perspective of understanding what is happening in the system [1] (metrics) and how important it can be to understand the frequency at which you monitor the metrics.
If you put several cluster admins in a room together (e.g., the BeoBash [2]), and you ask, "What is the best way to monitor a cluster?" you will have to duck and cover pretty quickly from the huge number of opinions and the great passion behind the answers. Having so many options and opinions is not a bad thing, but you need to sort through the ideas to find something that works for you and your situation.
In two further blog posts [3] [4], I wrote some simple scripts to measure metrics on a single server as a starting point for use in a cluster. This code measured the processes of interest by collecting data on an individual node basis.
Now it's time to look at monitoring frameworks where, I hope, the scripts will be useful for custom monitoring and perhaps provide a nice visual representation of the state of the cluster.
A non-exhaustive list of monitoring frameworks that people use to monitor system processes includes the following [5]-[12]:
- Monitorix
- Munin
- Cacti
- Ganglia
- Zabbix
- Zenoss Community
- Observium
- GKrellM
As you can see, you have a wide range of options, including commercial tools.
Ganglia is arguably the most popular framework, particularly in the HPC world, but it is also gaining popularity in the Big Data and private cloud world. In this article, I present Ganglia as a monitoring framework.
A Few Words
Ganglia has been in use for several years (since about 2001). Because of the sheer size of the systems involved, the HPC world has been using Ganglia, and in the past few years, the Big Data and Hadoop communities have been using it a great deal, primarily for its scalability and extensibility. The OpenStack and cloud communities frequently use it, too.
Ganglia has grown over the years and has gained the ability to monitor very large systems – into the 1,000-node range – as well as the ability to monitor close to 1,000 metrics for each system. You can run Ganglia on a number of different platforms, making it truly flexible. Additionally, it can use custom metrics written in a variety of languages including C, C++, and Python.
Ganglia has also gained a new web interface with custom graphs. At a high level, Ganglia comprises three parts: The first part is gmond , which is the part of Ganglia that gathers metrics on the servers to be monitored. Gmond shares its metrics with cluster peers using a simple "listen/announce" protocol via XDR (External Data Representation). By default, the announcements are made via multicast (port 8649 by default). At the same time, each gmond also listens to the announcements from its peers so that every node knows the metrics of all of the other nodes. The advantage of this architecture is that the master node only needs to communicate with one node instead of having to communicate with every single node. It also makes Ganglia more robust because, if a node dies, you can just talk to the other nodes to access the information.
The second part is called gmetad , which you install on the master node. With a list of gmond nodes, it just polls a gmond-equipped node in a cluster and gathers the data. This data is stored using RRDtool [13]. The third piece is called gweb and is the Ganglia web interface. This second-generation web interface offers custom graphs that you can create to match your situation and needs.
Installing on the Master Node
When working with a new tool, you have two choices: build from source or install pre-built binaries. Typically, I like to build from source the first time I use a new tool, so I can better understand the dependencies and how the tool is built. I spend some time building Ganglia following a blog by Sachin Sharma as a guide [14]. You can find my trail of tears in the ganglia-devel mailing list. It's not as easy as it seems, particularly if you want to include the Python modules. Additionally, it appears that no one really uses the default installation locations when building the code.
After a great deal of frustration, I finally gave up on building Ganglia by myself. I hate to admit defeat but Ganglia takes the first round. Fortunately, one of the Ganglia developers, Vladimir Vuksan [15], has some pre-built binaries that saved my bacon. The specific versions of the system software I use are listed in Table 1.
Table 1
Software
Software | Version |
---|---|
CentOS | 6.5 |
Ganglia | 3.6.0 |
Ganglia web | 3.5.12 |
Confuse | 2.7 |
RRDtool | 1.3.8 |
Before installing any binaries, I try to install the prerequisites rather than totally rely on RPM or Yum to resolve dependencies. I think this forces me to pay closer attention to the software rather than installing things willy-nilly. Thus, I installed the following packages:
yum install php yum install httpd yum install apr yum install libconfuse yum install expat yum install pcre yum install libcmemcached yum install rrdtool
The php
and httpd
packages are used for the web interface. I also recommend turning off SELinux (you can Google how to do this). I also turned off iptables for the purposes of this exercise. If you need to keep iptables turned on, please refer to Sharma's blog [14] and go to the bottom for details on how to configure iptables rules for Ganglia. If you get stuck, please email the Ganglia mailing list and ask for help. Now I'm ready to install Ganglia!
Configure/Compile/Install
Vladimir Vuksan also has created a set of CentOS 6 RPMs for the latest version of Ganglia [16]. These RPMs install on CentOS 6.5 and make your life a great deal easier. I just used the rpm
command and pointed to the URL for the RPMs. For the master node, I started with gmond (Listing 1). Next, I installed gmetad on the master node (Listing 2). Remember, you only need to install this on one node, typically the master node, for a basic configuration.
Listing 1
gmond Modules
rpm -ivh http://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmond-modules-python-3.6.0-1.x86_64.rpm http://vuksan.com/centos/RPMS-6/x86_64/libganglia-3.6.0-1.x86_64.rpmhttp://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmond-3.6.0-1.x86_64.rpm Retrieving http://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmond-modules-python-3.6.0-1.x86_64.rpm Retrieving http://vuksan.com/centos/RPMS-6/x86_64/libganglia-3.6.0-1.x86_64.rpm Retrieving http://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmond-3.6.0-1.x86_64.rpm Preparing... ########################################### [100%] 1:libganglia ########################################### [ 33%] 2:ganglia-gmond ########################################### [ 67%] 3:ganglia-gmond-modules-p########################################### [100%]
Listing 2
gmetab
[root@home4 RPMS]# rpm -ivh http://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmetad-3.6.0-1.x86_64.rpm Retrieving http://vuksan.com/centos/RPMS-6/x86_64/ganglia-gmetad-3.6.0-1.x86_64.rpm Preparing... ########################################### [100%] 1:ganglia-gmetad ########################################### [100%]
Ganglia installed very easily using these binaries, but I still had to do some configuration before declaring victory. Before configuring Ganglia however, I wanted to find out where certain components from the RPMs landed in the filesystem. The binaries, gmond and gmetad were installed in /usr/sbin
. (Note: If you build from source, they go into /usr/local/sbin
). You can use the command whereis gmond
to see where it is installed.
The man pages were installed in the usual location, /usr/share/man
. You can check this by running man gmond
. The binaries also installed key init scripts into /etc/rc.d/init.d/
that are used for starting, stopping, restarting, and checking the status of the gmond and gmetad processes. The scripts use Ganglia configuration files located in /etc/ganglia
. To see the files in this directory, use the tree command
(Figure 1). (Note: You may have to install this command with yum install tree
).
Ganglia has two files on which to direct your focus: gmond.conf
and gmetad.conf
. The subdirectory, conf.d
contains several Python configuration files that tell Ganglia about Python modules for collecting information about the system.
The file /etc/ganglia/conf.d/modpython.conf
contains the details of the Python metric modules. On my system, this file looks like Listing 3. This file tells you the Python metric module code is stored in /usr/lib64/ganglia/python_modules
. A quick peek at this directory is shown in Listing 4. I won't list any of the Python code, but this is where you will put any Python metrics you write. Before you can use the Python metrics, you have to tell Ganglia about them in the *.pyconf
files in /etc/ganglia/conf.d
(see Figure 1). In these files, you define the metrics and how often to collect them (remember my blog post [1] about paying attention to how often you collect monitoring information).
Listing 3
modpython.conf
[laytonjb@home4 ~]$ more /etc/ganglia/conf.d/modpython.conf /* params - path to the directory where mod_python should look for python metric modules the "pyconf" files in the include directory below will be scanned for configurations for those modules */ modules { module { name = "python_module" path = "modpython.so" params = "/usr/lib64/ganglia/python_modules" } } include ("/etc/ganglia/conf.d/*.pyconf")
Listing 4
python_modules Directory
[laytonjb@home4 ~]$ ls -s /usr/lib64/ganglia/python_modules total 668 16 apache_status.py* 4 example.pyo 8 netstats.pyc 16 tcpconn.py 12 apache_status.pyc 16 memcached.py 8 netstats.pyo 8 tcpconn.pyc 12 apache_status.pyo 12 memcached.pyc 12 nfsstats.py 8 tcpconn.pyo 8 DBUtil.py* 12 memcached.pyo 8 nfsstats.pyc 8 traffic1.py 12 DBUtil.pyc 16 mem_stats.py 8 nfsstats.pyo 8 traffic1.pyc 12 DBUtil.pyo 8 mem_stats.pyc 16 procstat.py 8 traffic1.pyo 8 diskfree.py 8 mem_stats.pyo 12 procstat.pyc 32 varnish.py* 4 diskfree.pyc 8 multidisk.py 12 procstat.pyo 16 varnish.pyc 4 diskfree.pyo 4 multidisk.pyc 4 redis.py* 16 varnish.pyo 12 diskstat.py 4 multidisk.pyo 4 redis.pyc 28 vm_stats.py 12 diskstat.pyc 8 multi_interface.py 4 redis.pyo 12 vm_stats.pyc 12 diskstat.pyo 8 multi_interface.pyc 12 riak.py* 12 vm_stats.pyo 4 entropy.py 8 multi_interface.pyo 8 riak.pyc 4 xenstats.py* 4 entropy.pyc 28 mysql.py* 8 riak.pyo 4 xenstats.pyc 4 entropy.pyo 24 mysql.pyc 8 spfexample.py 4 xenstats.pyo 4 example.py 24 mysql.pyo 4 spfexample.pyc 4 example.pyc 8 netstats.py 4 spfexample.pyo
The RPMs also add gmond and gmetad to chkconfig
, so you can control whether they start on boot or if you have to start them manually. You can discover this by running the chkconfig --list
command and examining the output (Listing 5). I cut out a lot of the output, but as you can tell, gmetad and gmond are configured to start on boot the next time the system is booted into runlevels 3 to 5 (i.e., runlevels 3 to 5 are on
). The Ganglia libraries were installed in /usr/lib64/
(Listing 6). When you first install Ganglia, gmond and gmetad services have not started yet. However, they start when the system is rebooted. In the next section, I begin to configure Ganglia by editing the files gmond.conf
and gmetad.conf
.
Listing 5
chkconfig --list
[laytonjb@home4 ~]$ chkconfig --list ... gmetad 0:off 1:off 2:on 3:on 4:on 5:on 6:off gmond 0:off 1:off 2:on 3:on 4:on 5:on 6:off ..
Listing 6
Ganglia Libraries
[laytonjb@home4 ~]$ ls -lstar /usr/lib64/libganglia* 104 -rwxr-xr-x 1 root root 106096 May 7 2013 /usr/lib64/libganglia-3.6.0.so.0.0.0* 0 lrwxrwxrwx 1 root root 25 Feb 10 17:29 /usr/lib64/libganglia-3.6.0.so.0 -> libganglia-3.6.0.so.0.0.0*
Buy this article as PDF
(incl. VAT)