« Previous 1 2
Resource monitoring for remote applications
Natural Resources
Using Remora
Remora is very simple to use: Just prepend it to your original command. For example, a simple command line for application ./myapp.exe
would become:
$ remora ./myapp.exe
In the case of MPI code, a command line would be something like
$ remora mpirun ... ./mpiapp.exe
if the original command was mpirun
… /mpiapp.exe
. Notice that both commands are run as user, not root, which goes back to the design of Remora: a focus on users and providing them with useful information.
This next example is Fortran 90 code for a simple serial Poisson solver for a rectangular grid (poisson_serial.f90
) [15] [16]. Remora captures data every 10 seconds by default, so you need to adjust a few application parameters in the Fortran program for a longer run time:
nx = 8000
ny = 8000
it_max = 10000
tolerance = 0.00004D+00
The code was compiled using GCC 7.1 and run on a four-core AMD A6-6310 laptop (Lenovo-G50-45). The output from the code and a summary from Remora are shown in Listing 1. Notice that it gives you the maximum memory used per node, as well as the run time of the application and the sampling time. It also lists the directory with the Remora output.
Listing 1
poisson_serial.f90 Output
[laytonjb@laytonjb REMORA_TEST]$ remora ./poisson_serial 23 August 2017 7:12:50.609 PM POISSON_SERIAL: FORTRAN90 version A program for solving the Poisson equation. -DEL^2 U = F(X,Y) on the rectangle 0 <= X <= 1, 0 <= Y <= 1. F(X,Y) = pi^2 * ( x^2 + y^2 ) * sin ( pi * x * y ) The number of interior X grid points is 8000 The number of interior Y grid points is 8000 The X grid spacing is 0.0001 The Y grid spacing is 0.0001 RMS of F = 5.99663 RMS of exact solution = 0.622184 Step ||Unew|| ||Unew-U|| ||Unew-Exact|| 0 0.111796E-01 0.622083 1 0.115237E-01 0.279491E-02 0.622039 2 0.119603E-01 0.156240E-02 0.622010 3 0.123543E-01 0.113207E-02 0.621986 4 0.127060E-01 0.904517E-03 0.621966 5 0.130230E-01 0.761265E-03 0.621948 6 0.133121E-01 0.661767E-03 0.621931 7 0.135782E-01 0.588130E-03 0.621916 8 0.138253E-01 0.531152E-03 0.621901 9 0.140562E-01 0.485586E-03 0.621888 10 0.142734E-01 0.448208E-03 0.621875 ... 246 0.266937E-01 0.402086E-04 0.620868 247 0.267182E-01 0.400864E-04 0.620866 248 0.267427E-01 0.399651E-04 0.620863 The iteration has converged, POISSON_SERIAL: Normal end of execution. 23 August 2017 7:21:31.215 PM ==================== REMORA SUMMARY ==================== Max Memory Used Per Node : 31.55 GB *** REMORA: WARNING - Free memory per node close to zero. Total Elapsed Time : 0d 0h 8m 40s 632ms ======================================================== Sampling Period : 10 seconds Complete Report Data : /home/laytonjb/REMORA_TEST/remora_1503529969 Graphical Results At : /home/laytonjb/REMORA_TEST/remora_1503529969/reora_summary.html ========================================================
Remora creates a subdirectory to contain the system information over time. For this particular test, that subdirectory is remora_1503529969
, in which I find a number of subdirectories with the raw data. Although you can parse the data in your subdirectories if you like, Remora creates a web page (HTML) that plots the data for you and is the easiest way to get a quick glimpse of what happened during application execution. Just open the web page in your favorite browser (Figure 1).
The summary page lists the system metrics that REMORA is capable of monitoring. A link below the metric means the corresponding data is available. Notice that for this simple case, only some of the metrics have been monitored. If you click the first link under "cpu utilization," you will see the plot in a new tab (Figure 2).
This laptop only has four cores, and Remora monitored all of them. Notice that the kernel moved the application from core 2 to core 1 (very briefly), and then to core 0 around 170-180 seconds into the run. The other cores don't run much of anything except system tasks.
The next obvious plot to examine is memory utilization (Figure 3), which includes the following:
- TMEM (Max) : Maximum total memory (takes into account the memory not being used by the application, the libraries needed by the application, and the OS).
- MEM (Free) : Free memory.
- SHMEM
: Shared memory (
/dev/shm
). Applications have access to shared memory by means of/dev/shm
. Any file put there counts toward the memory used by the application. - RMEM : Resident memory – physical memory used by the application.
- RMEM (Max) : Maximum resident memory.
- VMEM : Virtual memory (important to watch if the OOM killer kicks in)
- VMEM (Max) : Maximum virtual memory.
These memory metrics are gathered from /proc/[pid]/status
and /dev/shm
.
Summary
There are tools to do in-depth application profiling and there are tools to do system monitoring, but typically these tools are used by administrators or software developers. However, users are closest to their applications and know the specific problems that need to be solved, so putting tools into their hands can reap great results. Remora is a superb tool for users that will help them get an idea about the resource usage of their application. It's not profiling, but a combination of profiling and system monitoring. Moreover, it's easy to install, fairly light on resource usage, and can be a great help to users.
Infos
- HPC monitoring articles: http://www.admin-magazine.com/content/search?SearchText=Layton+monitoring&x=0&y=0
- HPC profiling articles: http://www.admin-magazine.com/content/search?SearchText=Layton+profiling&x=0&y=0
- MPI articles: http://www.admin-magazine.com/content/search?SearchText=Layton+MPI&x=0&y=0
- Remora: https://github.com/TACC/remora
- TACC: https://www.tacc.utexas.edu
numastat
: http://man7.org/linux/man-pages/man8/numastat.8.htmlmpstat
: https://linoxide.com/linux-command/linux-mpstat-command/nvidia-smi
: https://developer.nvidia.com/nvidia-system-management-interfaceibtracert
: https://linux.die.net/man/8/ibtracertibstatus
: http://manpages.ubuntu.com/manpages/trusty/man8/ibstatus.8.htmlxltop
: https://github.com/jhammond/xltop- Python: https://www.python.org/
- Lustre: http://lustre.org/
- Lmod: https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
- Poisson solver: http://people.sc.fsu.edu/~jburkardt/f_src/poisson_serial/poisson_serial.html
- Code for this article: ftp://ftp.linux-magazine.com/pub/listings/admin-magazine.com/41/
« Previous 1 2
Buy this article as PDF
(incl. VAT)