« Previous 1 2
Darshan I/O analysis for Deep Learning frameworks
Looking and Seeing
Summary
A small amount of work has taken place in the past characterizing or understanding the I/O patterns of DL frameworks. In this article, Darshan, a widely accepted I/O characterization tool rooted in the HPC and MPI world, was used to examine the I/O pattern of TensorFlow running a simple model on the CIFAR-10 dataset.
Deep Learning frameworks that use the Python language for training the model open a large number of files as part of the Python and TensorFlow startup. Currently, Darshan can only accommodate 1,024 files. As a result, the Python directory had to be excluded from the analysis, which could be a good thing, allowing Darshan to focus more on the training. However, it also means that Darshan can't capture all of the I/O used in running the training script.
With the simple CIFAR-10 training script, not much I/O took place overall. The dataset isn't large, so it can fit in GPU memory. The overall runtime was dominated by compute time. The small amount of I/O that was performed was almost all write operations, probably writing the checkpoints after every epoch.
I tried larger problems, but reading the data, even if it fit into GPU memory, led to exceeding the current 1,024-file limit. However, the current version of Darshan has shown that it can be used for I/O characterization of DL frameworks, albeit for small problems.
The developers of Darshan are working on updates to break the 1,024-file limit. Although Python postprocessing exists, the developers are rapidly updating that capability. Both developments will greatly help the DL community in using Darshan.
Infos
- Darshan: https://www.mcs.anl.gov/research/projects/darshan/
- Documentation for multiuser systems: https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_environment_preparation
- Darshan mailing list: https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
- "Understanding I/O Patterns with strace, Part II" by Jeff Layton, https://www.admin-magazine.com/HPC/Articles/Tuning-I-O-Patterns-in-Fortran-90
- POSIX I/O functions: https://www.mkompf.com/cplus/posixlist.html
- Keras: https://keras.io/
- CIFAR-10 data: https://www.cs.toronto.edu/~kriz/cifar.html
- "How to Develop a CNN From Scratch for CIFAR-10 Photo Classification" by Jason Brownlee, accessed July 15, 2021: https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-cifar-10-photo-classification/
- Anaconda Python: https://www.anaconda.com/products/individual
« Previous 1 2
Buy this article as PDF
(incl. VAT)