Lead Image Gustavo Quepón on Unsplash.com

Lead Image Gustavo Quepón on Unsplash.com

Resource monitoring for remote applications

Natural Resources

Article from ADMIN 41/2017
By
Remora combines profiling and system monitoring to help you get to the root of application problems by revealing its use of resources.

Monitoring systems and profiling applications have long been a passion of mine. In the case of monitoring [1], I've taken the point of view that the system administrator should focus on monitoring the system as a whole and on keeping track of system behavior over time by asking questions such as: "Is it performing as it should?" and "Are the resources being utilized as much as possible?"

In the case of profiling [2], I have focused on individual applications, either serial or parallel [3]. Profiling usually means trying to understand application resource usage patterns by answering questions such as: "How does the application use the CPU?" and "How does the application perform I/O?" Answering these questions are some of the goals of application profiling.

Remora

A very useful HPC tool named REMORA (REsource MOnitoring for Remote Applications; hereafter referred to as Remora ) [4] from the University of Texas Advanced Computing Center (TACC) [5] combines monitoring and profiling to provide information about an application. Unlike pure system monitoring or general profiling, it is focused on the user and the user's application , and the results are intended to help the user understand the resources that were used to run an application.

Remora is not strictly a profiler, and it's not strictly a monitoring tool in the traditional sense of monitoring the entire cluster. Rather, it provides per-node and per-job resource utilization data. This data can be used to understand how the application performs on the system. As a result, changes can be made to certain aspects of the code or how it was run. The data collected by Remora can be used to improve code performance or detect issues (profiling and monitoring). Additionally, users can go back and examine their resource usage, in the event that something changes in the application or at run time.

Moreover, the information can be used by administrators to understand how users are utilizing resources. For example, the information can be used to determine how many cores, how much memory, how much I/O, and so on were used while running an application. This information can be used to adjust how resources are scheduled.

The keys to Remora are its simplicity, its use of commonly installed tools, and its focus on the user that puts data and information in the user's hands. The data can also be used by admins in a collective way to understand how the system is being used.

Data Streams

The key focus of Remora is to provide a run-time resource monitoring tool for users. It provides high-level information and detailed statistics to the user when an application is executed. This data is collected and put into a subdirectory, along with an HTML file that can be used for plotting the results.

Remora collects several streams of information:

  • Memory usage (CPUs, Xeon Phi, and Nvidia GPUs)
  • CPU utilization
  • I/O usage (Lustre, DVS)
  • NUMA properties
  • Network topology
  • MPI communication statistics
  • Power consumption
  • CPU temperatures
  • Detailed application timing

To capture all of this information, Remora uses SSH to connect to all of the nodes used by the application. It spawns a background task on each of these nodes and regularly captures the data. However, the I/O data is only captured on the master node of the application.

No special applications are used by Remora to gather the information. Rather, existing tools are used, along with information parsed in the /proc/ table. A partial list of the tools and data sources used includes:

  • numastat [6]
  • mpstat [7] (a personal favorite)
  • nvidia-smi [8]
  • ibtracert [9]
  • ibstatus [10]
  • xltop [11]
  • Python [12]
  • /proc/meminfo
  • /proc/[pid]/status
  • /proc/sys/lnet/stats
  • /sys/class/infiniband

Remora uses these tool and data sources to collect information within a specific interval while the application runs. It only collects the information associated with the application. In the case of message-passing interface (MPI) applications, it grabs the host node list of environment variables and uses that to ssh into the nodes and gather data.

When Remora is finished, it creates a directory in the form remora-XXX in the directory in which the application was run. Subdirectories contain the raw data, and an HTML page lets you examine and plot the data.

When run, Remora collects data from as many sources as it can find. For example, if it detects that Lustre [13] is installed, it will grab data for that. If it detects the presence of an InfiniBand network, it will collect data for that. If it doesn't detect a source, it can't gather data for it or create a chart.

Installing Remora

Installing Remora is not difficult; the approach is slightly different from the usual ./configure; make; make install. You also need to be aware that because Remora can provide MPI statistics, you need to build it with the intended version of MPI (i.e., don't cross MPIs). I built Remora with the command,

REMORA_INSTALL_REFIX=/home/laytonjb/bin/remora-1.8.2 ./install.sh

which installs to a directory in my home account. If more than one user is to have access, you can install Remora in a common directory.

If you use multiple versions of MPI, you need to build Remora for each version. If you are using environment modules (e.g., Lmod [14]), you can write a module for Remora, so it is added to the environment when the corresponding MPI module is loaded.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Remora – Resource Monitoring for Users

    Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.

  • REMORA

    Remora combines profiling and system monitoring to help you get to the root of application problems by revealing its use of resources.

  • HPC resource monitoring for users
    Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.
comments powered by Disqus