Darshan I/O analysis for Deep Learning frameworks
Looking and Seeing
The Darshan [1] userspace tool is often used for I/O profiling of HPC applications. It is broken into two parts: The first part, darshan-runtime , gathers, compresses, and stores the data. The second part, darshan-util , postprocesses the data.
Darshan gathers its data either by compile-time wrappers or dynamic library preloading. For message passing interface (MPI) applications, you can use the provided wrappers (Perl scripts) to create an instrumented binary. Darshan uses the MPI profiling interface of MPI applications for gathering information about I/O patterns. It does this by "… injecting additional libraries and options into the linker command line to intercept relevant I/O calls" [2] (section 5.1).
For MPI applications, you can also profile pre-compiled binaries. It uses the LD_PRELOAD
environment variable to point to the Darshan shared library. This approach allows you to run uninstrumented binaries for which you don't have the source code (perhaps independent software vendor applications) or applications for which you don't want to rebuild the binary.
For non-MPI applications you have to use the LD_PRELOAD
environment variable and the Darshan shared library.
Deep Learning (DL) frameworks such as TensorFlow are becoming an increasingly big part of HPC workloads. Because one of the tenets of DL is using as much data as possible, understanding the I/O patterns of these applications is important. Terabyte datasets are quite common. In this article, I take Darshan, a tool based on HPC and MPI, and use it to examine the I/O pattern of TensorFlow on a small problem – one that I can run on my home workstation.
Installation
To build Darshan for non-MPI applications, you should set a few options when building the Darshan runtime (darshan-runtime
). I used the autoconf
command:
../configure --with-log-path=/home/laytonjb/darshan-logs --with-jobid-env=NONE --enable-mmap-logs --enable-group-readable-logs --without-mpi CC=gcc --prefix=[binary location]
Because I'm the only one using the system, I put the Darshan logs (the output from Darshan) in a directory in my home directory (/home/laytonjb/darshan-log
), and I installed the binaries into my /home
directory; however, this is not a good idea on multiuser systems. The script preps the environment and creates a directory hierarchy in the log directory (the one specified when configuring Darshan). The organization of the hierarchy is simple. The topmost directory is the year. Below that is the month. Below that is the day.
After the usual, make; make install
, you should run the command
darshan-mk-log.pl
which preps the log directory. For multiuser systems, you should read the documentation [2] (section 3.2). Next, I built the Darshan utilities (darshan-util
) with the command:
./configure CC=gcc --prefix=[binary location]
Because I'm running these tests on an Ubuntu 20.04 system, I had to install some packages for the postprocessing (darshan-util
) tools to work:
texlive-latex-extra libpod-latex-perl
Different distributions may require different packages. If you have trouble, the Darshan mailing list [3] is awesome. (You'll see my posts where I got some help when I was doing the postprocessing.)
Simple Example
Before jumping into Darshan with a DL example, I want to test a simple example, so I can get a feel for the postprocessing output. I grabbed an example from a previous article [4] (Listing 1). Although this example doesn't produce much I/O, I was curious to see whether Darshan could profile the I/O the application does create.
Listing 1
I/O Example Code
01 program ex1 02 03 type rec 04 integer :: x, y, z 05 real :: value 06 end type rec 07 08 integer :: counter 09 integer :: counter_limit 10 integer :: ierr 11 12 type(rec) :: my_record 13 14 counter_limit = 2000 15 16 ierr = -1 17 open(unit=8,file="test.bin", status="replace", & 18 action="readwrite", & 19 iostat = ierr) 20 if (ierr > o) then 21 write(*,*) "error in opening file Stopping" 22 stop 23 else 24 do counter = 1,counter_limit 25 my_record%x = counter 26 my_record%y = counter + 1 27 my_record%z = counter + 2 28 my_record%value = counter * 10.0 29 write(8,*) my_record 30 end do 31 end if 32 33 close(8) 34 35 end program ex1
For this example, the command I used for Darshan to gather I/O statistics on the application was:
env LD_PRELOAD=/home/laytonjb/bin/darshan-3.3.1/lib/libdarshan.so ./ex1
I copied the Darshan file from the log location to a local directory and ran the Darshan utility darshan-job-summary.pl
against the output file. The result is a PDF file that summarizes the I/O of the application.
Rather than include the entire PDF file in this article, I grabbed some of the plots and tables and present them here. Figure 1 shows a quick summary at the beginning of the output and says that it measured 0.1MiB of I/O. It also estimates the I/O rate at 544.77MiBps.
The left-hand chart presents the percentage of the run time for read, write, and metadata I/O and computation. For this case, it shows that the runtime is entirely dominated by computation. The read, write, and metadata bars are negligible and really can't even be seen.
The chart on the right presents the number of some specific I/O operations, with one open operation and 27 write operations. It also shows that all of the I/O is done by POSIX I/O functions [5].
The next snippet of the summary PDF is shown in the histogram in Figure 2, a plot of the typical payloads for read and write function calls (i.e., how many function calls use how much data per read or write per I/O function call). This chart shows that all of the write payloads are between 1 and 10KiB and that this example had no reads.
The job summary from Darshan also creates some very useful tables (Figure 3). The table on the left presents the most common payload sizes for POSIX I/O functions, with two access sizes, 4104 and 1296 bytes. The table on the right presents file-based I/O stats. Only one file was involved in this really simple example, and it was a write-only file. Note that the Darshan summary matches the source code; the one and only file was opened as write only. The table also shows that the average size of the file was 106KiB (the same as the maximum size).
Another group of useful tables is shown in Figure 4. The top table presents the cumulative time spent in reads and writes for both independent and shared operations. Don't forget that Darshan's origins are in MPI and HPC I/O, where shared files are common. It also presents information on how much I/O was performed for both reads and writes. Notice that the write time was really small (0.000127sec), and the I/O was also very small (0.103MiB).
Darshan also shows a great stat in the amount of time spent on metadata. Just focusing on read and write performance is not quite enough for understanding I/O. Metadata I/O can have a big effect, and separating it out from reads and writes is very useful.
The bottom table gives the total I/O for the various filesystems. This table is a bit more useful for HPC applications that use a scratch filesystem for I/O and a filesystem for storing the application binaries. Increasingly, DL applications are using this approach, so examining this table is useful.
The last snippet of the job summary I want to highlight is the number of read/write I/O operations (Figure 5). The figure shows only write operations, as expected. The total operations include all I/O functions.
Darshan with TensorFlow
Darshan has had a number of successes with MPI applications. For this article, I tried it on a TensorFlow framework, with Keras [6] loading the data, creating the model, training the model for only 100 epochs, checkpointing after every epoch, and saving the final model.
The system I'll be using is my home workstation with a single Titan V card, a six-core AMD Ryzen CPU, and 32GB of memory. I plan to use the CIFAR-10 data [7] and use the training code from Jason Brownlee's Machine Learning Mastery website [8]. I'll start the training from the beginning (no pre-trained models) and run it for 100 epochs. I've updated the training script to checkpoint the model weights after every epoch to the same file (it just overwrites it).
The code is written in Python and uses Keras as the interface to TensorFlow. Keras is great for defining models and training. I used the individual edition of Anaconda Python [9] for this training. The specific software versions I used were:
- Ubuntu 20.04
- Conda 4.10.3
- Python 3.8.10
- TensorFlow 2.4.1
- cudatoolkit 10.1.243
- System CUDA 11.3
- Nvidia driver 465.19.01
A summary of the model is shown in Table 1, with a total of six convolution layers, a maximum of three pooling layers, and the final flattening layer followed by a fully connected layer (dense) that connects to the final output layer for the 10 classes.
Table 1
Training Model
Layer (type) | Output Shape | No. of Parameters |
---|---|---|
conv2d (Conv2D)
|
(None, 32, 32, 32) | 896 |
conv2d_1 (Conv2D)
|
(None, 32, 32, 32) | 9,248 |
max_pooling2d (MaxPooling2D)
|
(None, 16, 16, 32) | 0 |
conv2d_2 (Conv2D)
|
(None, 16, 16, 64) | 18,496 |
conv2d_3 (Conv2D)
|
(None, 16, 16, 64) | 36,928 |
max_pooling2d_1 (MaxPooling2)
|
(None, 8, 8, 64) | 0 |
conv2d_4 (Conv2D)
|
(None, 8, 8, 128) | 73,856 |
conv2d_5 (Conv2D)
|
(None, 8, 8, 128) | 147,584 |
max_pooling2d_2 (MaxPooling2)
|
(None, 4, 4, 128) | 0 |
flatten (Flatten)
|
(None, 2048) | 0 |
dense (Dense)
|
(None, 128) | 262,272 |
dense_1 (Dense)
|
(None, 10) | 1,290 |
Total parameters | 550,570 | |
Trainable parameters | 550,570 | |
Non-trainable parameters | 0 |
The first two convolutional layers have 32 filters each, the second two convolutional layers have 64 filters each, and the final two convolutional layers have 128 filters each. The maximum pooling layers are defined after every two convolution layers with a 2x2 filter. The fully connected layer has 128 neurons. The total number of parameters is 550,570 (a very small model).
The command to run the training script with Darshan is shown in Listing 2. Running the training script takes several lines, so I just put them into a Bash script and run that.
Listing 2
Training Script
export DARSHAN_EXCLUDE_DIRS=/proc,/etc,/dev,/sys,/snap,/run,/user,/lib,/bin,/home/laytonjb/anaconda3/lib/python3.8,/home/laytonjb/bin,/tmp export DARSHAN_MODMEM=20000 env LD_PRELOAD=/home/laytonjb/bin/darshan-3.3.1/lib/libdarshan.so python3 cifar10-4.checkpoint.py
When the training script is run, Python starts and Python modules are loaded, which causes a large number of Python modules to be converted (compiled) to byte code (.pyc
files). Darshan can currently only monitor 1,024 files during the application run, and running the training script exceeded this limit. Because most of the files being compiled were Python modules, the Python directory (/home/laytonjb/anaconda3/lib/python3.8
) had to be excluded from the Darshan analysis. Other directories were excluded, too, because they don't contribute much to the overall I/O and could cause Darshan to exceed the 1,024-file limit. The first line of the script excludes specific directories.
The second line in the script increases the amount of memory the Darshan instrumentation modules can collectively use at runtime. By default, the amount of memory is 2MiB, but I allowed 19.53GiB (20GB) to make sure I gathered the I/O data.
Remember that the Darshan runtime just collects the I/O information during the run. It does not calculate any statistics or create a summary. After it collects the information and the application is finished, the Darshan utilities can be run. I used the darshan-job-summary.pl
tool to create a PDF summary of the analysis.
The top of the PDF file (Figure 6) gives you some quick highlights of the analysis. The very top line in the output says that one processor was used, and it took 1,404sec for the application to complete. You can also see that the POSIX interface (the POSIX I/O functions) transferred 770.9MiB of data at 416.94MiBps. The STDIO interface (STDIO I/O functions) transferred 0.0MiB at 33.46MiBps. The amount of data transferred through the STDIO interface is so small that the output shows 0.0MiB, which would indicate less than 0.499MiB of data (the code would round this down to one decimal place, or 0.0).
The first snippet of the job summary is shown in Figure 7. The top left-hand chart shows that virtually all of the POSIX and STDIO runtime is for computation. You can see a little bit of I/O at the very bottom of the POSIX bar (the left bar in the chart).
The top right-hand chart shows very little STDIO (green bars). The rest is POSIX I/O, dominated by writes. Recall that Darshan has to be run excluding any I/O in the directory /home/laytonjb/anaconda3/lib/python3.8
(the location of Python).
When Python converts (compiles) the source, it reads the code and writes the byte code (PYC extension). This I/O is not captured by Darshan. One can argue whether this is appropriate or not because this conversion is part of the total training runtime. On the other hand, excluding that I/O focuses the Darshan I/O analysis on the training and not on Python.
About 54,000 write operations occur during training. (Recall that a checkpoint is written after every epoch.) About 6,000 read operations appeared to have occurred during the training, which seems to be counterintuitive because DL training involves repeatedly going through the dataset in a different order for each epoch. Two things affect this: (1) TensorFlow has an efficient data interface that minimizes read operations; (2) all of the data fits into GPU memory. Therefore, you don't see as many read operations as you would expect.
A few metadata operations take place during the training. The top-right chart records a very small number of open
and stat
operations, as well as a number of lseek
operations – perhaps 7,500. Although there doesn't appear to be a noticeable number of mmap
or fsync
operations, if Darshan puts them in the chart, that means a non-zero number of each occur.
The bottom chart in Figure 7 provides a breakdown of read and write operations for the POSIX I/O functions. The vast majority of write operations are in the range of 0-100 bytes. A much smaller number of write operations occur in other ranges (e.g., some in the 101 bytes to 1KiB and 1-10KiB ranges, and a very small number in the 10-100KiB and 100KiB to 1MiB ranges). All of the ranges greater than 100 bytes have a much smaller number of write operations than the lowest range.
A small number of read operations are captured in that chart as well. Most read operations appear to be in the 1-10KiB range, with a few in the 10-100KiB range.
Notice that for both reads and writes, the data per read or write operation is small. A rule of thumb for good I/O performance is to use the largest possible read or write operations, preferably in the mebibyte and greater range. The largest part of the operations for this DL training is in the 1-100 byte range, which is very small.
Figure 8 presents two tables. The left-hand table presents the most common access sizes for POSIX operations. For reads and writes, this is the average size of the data per I/O function call. The right-hand chart presents data on the files used during the run.
The left-hand table shows that the vast majority of POSIX operations, predominately writes with a small number of reads, are in the 84-86 byte range. These account for 46,881 operations of the roughly 50,000 read and write operations (~94%). This table reinforces the observations from the charts in Figure 7.
The right-hand table shows that 14 files were opened, of which 10 were read-only and three were write-only. The average size of the files was 25MiB.
The next two tables (Figure 9) present more information about the amount of I/O. The top table first shows the amount of time spent doing reads, writes, and metadata. For the DL training problem, 0.063sec was spent on reads, 0.44sec on writes, and 1.35sec on metadata (non-read and -write I/O operations), or a total of 1.8491sec spent on some type of I/O out of 1,404sec of runtime (0.13%), which illustrates that this problem is virtually 100 percent dominated by computation.
The last column of the top table also presents the total amount of I/O. The read I/O operations, although only a very small number, total 340MiB. The write operations, which dominated the reads, total only 430MiB. These results are very interesting considering the roughly nine write I/O operations for every read operation.
Finally, the DL training has no shared I/O, so you don't see any shared reads, writes, or metadata in the top table. Darshan's origins are in the HPC and MPI world, where shared I/O operations are common, so it will present this information in the summary.
The bottom table presents how much I/O was done for the various filesystems. A very small amount of I/O is attributed to UNKNOWN
(0.887%), but /home
(except for /home/laytonjb/anaconda3
, which was excluded from the analysis) had 99.1 percent of the write I/O and 100 percent of the read I/O.
The final snippet of Darshan output is in Figure 10, where the I/O operation sequences are presented. The chart shows roughly 54,000 total write and perhaps 6,000 total read operations. For the write I/O, most were sequential (about 52,000), with about 47,000 consecutive operations.
Buy this article as PDF
(incl. VAT)