Remora – Resource Monitoring for Users

Summary

HPC admins are always looking for better ways to monitor the systems for which they are responsible by understanding how the hardware is operating and seeing how user applications are performing. Many tools and techniques – both hardware and software – are available to coordinate monitoring with resource managers (job schedulers), all of which are administrator-oriented tools.

Users have precious few tools to monitor the resources their applications are using. With “application telemetry” information, users can understand the pattern of their application, whether it seems to be performing correctly or incorrectly, what resources it consumed, and how their application is balanced across several nodes in the system – or even a single node.

Remora from TACC can gather this information for you and create plots to help guide you to a better understanding of your application without affecting its performance. Typically, the system administrator installs Remora, but users can install it in their accounts, as well.

Tuning the Remora installation is possible, particularly around what is monitored. Once installed, you just put the command remora before the command that runs the application, and you start gathering information. A few environment variables adjust how Remora gathers the data, but for the most part, it just silently gathers the data for you.

Remora is a great tool for users who want an idea of their application resource usage. Not pure profiling, Remora is really a combination of profiling and system monitoring. Remora is easy to install and fairly light on resource usage and can be a great help to users.