Resource Monitoring For Remote Applications
Monitoring systems and profiling applications have long been a passion of mine.In the case of monitoring, I've taken the point of view that the system administrator should focuson monitoring the system as a whole andon keeping track of system behavior over time byasking questions such as, “Is it performing as it should?” and “Are the resources being utilized as much as possible?”
In the case of profiling, I have focused on individual applications, either serial or parallel. Profiling usually means trying to understand application resource usage patterns by answering questions such as, “How does the application use the CPU?” and “How does the application perform I/O?” Answering these questions are some of the goals of application profiling.
Remora
A very useful HPC tool named REMORA (REsource MOnitoring for Remote Applications; hereafter referred to as Remora) from the University of Texas Advanced Computing Center (TACC), combines monitoring and profiling to provide information about an application. Unlike pure system monitoring or general profiling, it is focused on the user and the user’s application, and the results are intended to help the user understand the resources that were used to run an application.
Remora is not strictly a profiler, and it's not strictly a monitoring tool in the traditional sense of monitoring the entire cluster. Rather, it provides per-node and per-job resource utilization data. This data can be used to understand how the application performs on the system. As a result, changes can be made to certain aspects of the code or how it was run. The data collected by Remora can be used to improve code performance or detect issues (profiling and monitoring). Additionally, users can go back and examine their resource usage, in the event that something changes in the application or at run time.
Moreover, the information can be used by administrators to understand how users are utilizing resources. For example, the information can be used to determine how many cores, how much memory, how much I/O, and so on were used while running an application. This information can be used to adjust how resources are scheduled.
The keys to Remora are its simplicity, itsuse of commonly installed tools, and its focus on the user that puts data and information in the user's hands. The data can also be used by admins in a collective way to understand how the system is being used.
Data Streams
The key focus of Remora is to provide a run-time resource monitoring tool for users. It provides high-level information and detailed statistics to the user when an application is executed. This data is collected and put into a subdirectory, along with an HTML file that can be used for plotting the results.
Remora collects several streams of information:
- Memory usage (CPUs, Xeon Phi, and Nvidia GPUs)
- CPU utilization
- I/O usage (Lustre, DVS)
- NUMA properties
- Network topology
- MPI communication statistics
- Power consumption
- CPU temperatures
- Detailed application timing
To capture all of this information, Remora uses SSH to connect to all of the nodes used by the application. It spawns a background task on each of these nodes and regularly captures the data. However, the I/O data is only captured on the master node of the application.
No special applications are used by Remora to gather the information. Rather, existing tools are used, along with information parsedin the /proc/ table. A partial list of the tools and data sources used includes:
- numastat
- mpstat (a personal favorite)
- nvidia-smi
- ibtracert
- ibstatus
- xltop
- python
- /proc/meminfo
- /proc/[pid]/status
- /proc/sys/lnet/stats
- /sys/class/infiniband
Remora uses these tool and data sources to collect information within a specific interval while the application runs. It only collects the information associated with the application. In the case of message-passing interface (MPI) applications, it grabs the host node list of environment variables and uses that to ssh into the nodes and gather data.
When Remora is finished, it creates a directory in the form remora-XXX in the directory in which the application was run. Subdirectories contain the raw data, and an HTML page lets you examine and plot the data.
When run, Remora collects data from as many sources as it can find. For example, if it detects that Lustre is installed, it will grab data for that. If it detects the presence of an InfiniBand network, it will collect data for that. If it doesn't detect a source, it can’t gather data for it or create a chart.