Resource monitoring for remote applications

Natural Resources

Using Remora

Remora is very simple to use: Just prepend it to your original command. For example, a simple command line for application ./myapp.exe would become:

$ remora ./myapp.exe

In the case of MPI code, a command line would be something like

$ remora mpirun ... ./mpiapp.exe

if the original command was mpirun/mpiapp.exe. Notice that both commands are run as user, not root, which goes back to the design of Remora: a focus on users and providing them with useful information.

This next example is Fortran 90 code for a simple serial Poisson solver for a rectangular grid (poisson_serial.f90) [15] [16]. Remora captures data every 10 seconds by default, so you need to adjust a few application parameters in the Fortran program for a longer run time:

  • nx = 8000
  • ny = 8000
  • it_max = 10000
  • tolerance = 0.00004D+00

The code was compiled using GCC 7.1 and run on a four-core AMD A6-6310 laptop (Lenovo-G50-45). The output from the code and a summary from Remora are shown in Listing 1. Notice that it gives you the maximum memory used per node, as well as the run time of the application and the sampling time. It also lists the directory with the Remora output.

Listing 1

poisson_serial.f90 Output

[laytonjb@laytonjb REMORA_TEST]$ remora ./poisson_serial
23 August 2017   7:12:50.609 PM
POISSON_SERIAL:
  FORTRAN90 version
  A program for solving the Poisson equation.
  -DEL^2 U = F(X,Y)
  on the rectangle 0 <= X <= 1, 0 <= Y <= 1.
  F(X,Y) = pi^2 * ( x^2 + y^2 ) * sin ( pi * x * y )
  The number of interior X grid points is     8000
  The number of interior Y grid points is     8000
  The X grid spacing is     0.0001
  The Y grid spacing is     0.0001
  RMS of F =    5.99663
  RMS of exact solution =   0.622184
  Step    ||Unew||     ||Unew-U||     ||Unew-Exact||
      0    0.111796E-01                    0.622083
      1    0.115237E-01    0.279491E-02    0.622039
      2    0.119603E-01    0.156240E-02    0.622010
      3    0.123543E-01    0.113207E-02    0.621986
      4    0.127060E-01    0.904517E-03    0.621966
      5    0.130230E-01    0.761265E-03    0.621948
      6    0.133121E-01    0.661767E-03    0.621931
      7    0.135782E-01    0.588130E-03    0.621916
      8    0.138253E-01    0.531152E-03    0.621901
      9    0.140562E-01    0.485586E-03    0.621888
     10    0.142734E-01    0.448208E-03    0.621875
...
    246    0.266937E-01    0.402086E-04    0.620868
    247    0.267182E-01    0.400864E-04    0.620866
    248    0.267427E-01    0.399651E-04    0.620863
  The iteration has converged,
POISSON_SERIAL:
  Normal end of execution.
23 August 2017   7:21:31.215 PM
==================== REMORA SUMMARY ====================
 Max Memory Used Per Node  : 31.55 GB
*** REMORA: WARNING - Free memory per node close to zero.
 Total Elapsed Time        : 0d 0h 8m 40s 632ms
========================================================
 Sampling Period           : 10 seconds
 Complete Report Data      : /home/laytonjb/REMORA_TEST/remora_1503529969
 Graphical Results At      : /home/laytonjb/REMORA_TEST/remora_1503529969/reora_summary.html
========================================================

Remora creates a subdirectory to contain the system information over time. For this particular test, that subdirectory is remora_1503529969, in which I find a number of subdirectories with the raw data. Although you can parse the data in your subdirectories if you like, Remora creates a web page (HTML) that plots the data for you and is the easiest way to get a quick glimpse of what happened during application execution. Just open the web page in your favorite browser (Figure 1).

Figure 1: Summary web page.

The summary page lists the system metrics that REMORA is capable of monitoring. A link below the metric means the corresponding data is available. Notice that for this simple case, only some of the metrics have been monitored. If you click the first link under "cpu utilization," you will see the plot in a new tab (Figure 2).

Figure 2: CPU utilization.

This laptop only has four cores, and Remora monitored all of them. Notice that the kernel moved the application from core 2 to core 1 (very briefly), and then to core 0 around 170-180 seconds into the run. The other cores don't run much of anything except system tasks.

The next obvious plot to examine is memory utilization (Figure 3), which includes the following:

  • TMEM (Max) : Maximum total memory (takes into account the memory not being used by the application, the libraries needed by the application, and the OS).
  • MEM (Free) : Free memory.
  • SHMEM : Shared memory (/dev/shm). Applications have access to shared memory by means of /dev/shm. Any file put there counts toward the memory used by the application.
  • RMEM : Resident memory – physical memory used by the application.
  • RMEM (Max) : Maximum resident memory.
  • VMEM : Virtual memory (important to watch if the OOM killer kicks in)
  • VMEM (Max) : Maximum virtual memory.

These memory metrics are gathered from /proc/[pid]/status and /dev/shm.

Figure 3: Memory utilization.

Summary

There are tools to do in-depth application profiling and there are tools to do system monitoring, but typically these tools are used by administrators or software developers. However, users are closest to their applications and know the specific problems that need to be solved, so putting tools into their hands can reap great results. Remora is a superb tool for users that will help them get an idea about the resource usage of their application. It's not profiling, but a combination of profiling and system monitoring. Moreover, it's easy to install, fairly light on resource usage, and can be a great help to users.

The Author

Jeff Layton has been in the HPC business for almost 25 years (starting when he was 4 years old). He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Remora – Resource Monitoring for Users

    Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.

  • REMORA

    Remora combines profiling and system monitoring to help you get to the root of application problems by revealing its use of resources.

  • HPC resource monitoring for users
    Remora provides per-node and per-job resource utilization data that can be used to understand how an application performs on the system through a combination of profiling and system monitoring.
comments powered by Disqus