Simple HDF5 in Python and Fortran
Using the popular HDF5 I/O library with Python and Fortran.
HDF5 is one of the most popular I/O libraries in HPC. It uses a familiar filesystem hierarchy; it is flexible, self-describing, and portable across operating systems and hardware; can store text and binary data, can be used by parallel applications (MPI), has a large number of language plugins; and is fairly easy to use.
In a previous article, I introduced HDF5, focusing on the concepts and strengths. In this article. I want to give a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics of using it. I'll start with Python because it is a widely used language and the HDF5 Python library h5py is very easy to use and very easy to understand. I also want to illustrate how to use HDF5 with a compiled language. In particular, I want to use Fortran for illustrating how a compiled program works with HDF5.
h5py Python Library
H5py is the dominant Python interface to HDF5. It is included with many Python distributions and with most Linux distributions. For the examples here, I use the Anaconda Python distribution for Python 2.7.
The examples I use in this article are fairly simple and are derived from the Quick Start page on the h5py website. The first example simply illustrates a few concepts, such as:
- Opening an HDF5 file for writing
- Creating data sets
- Creating groups
The simple Python script in Listing 1 incorporates these concepts.
Listing 1: Starting Out with h5py
01 #!/home/laytonjb/anaconda2/bin/python 02 03 import h5py 04 import numpy as np 05 06 # =================== 07 # Main Python section 08 # =================== 09 # 10 if __name__ == '__main__': 11 12 f = h5py.File("mytestfile.hdf5", "w") 13 14 dset = f.create_dataset("mydataset", (100,), dtype='i') 15 16 dset[...] = np.arange(100) 17 18 print "dset.shape = ",dset.shape 19 20 print "dset.dtype = ",dset.dtype 21 22 print "dset.name = ",dset.name 23 24 print "f.name = ",f.name 25 26 grp = f.create_group("subgroup"); 27 28 dset2 = grp.create_dataset("another_dataset", (50,), dtype='f'); 29 print "dset2.name = ",dset2.name 30 31 dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i') 32 print "dset3.name = ",dset3.name 33 34 # end if
The first h5py command is line 12, which opens a file for writing. If the file exists, it will overwrite; if it doesn't exist, it will create the file. Remember that HDF5 is really a container for data objects. When you create a file, the library creates a number of defaults, such as the root group (/). Therefore the file will be non-zero in size, even if no data or attributes are written into it.
After the file is opened and created, a data set with 100 integers is created (mydataset in line 14). At this point, only the object for the dataset is created in the file (dataspace). Line 16 puts data into the data object using numpy. Notice that you can put data into the object, and the h5py library will take care of updating the HDF5 file. We could also modify the data in the file.
Recall that in Python, almost everything is an object, so it has properties. Lines 18, 20, 22, and 24 print out some of the properties of the HDF5 file (line 24) as well as the first data set (lines 18, 20, and 22). Because HDF5 is object based, it fits well with the object nature of Python.
On line 26, a subgroup to the root group (subgroup) is created; then, on line 28, a new data set that resides in this subgroup is created using a float data type that starts with 50 elements. Notice that a method of the group object is used for this.
On line 31, a new dataset is created. What is unique is that the dataset is created in a new subgroup named subgroup2. H5py will automatically create the subgroup if it doesn't exist.
The output from this example Python script is show below:
[laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test.py dset.hsape = (100,) dset.dtype = int32 dset.name = /mydataset f.name = / dset2.name = /subgroup/another_dataset dset3.name = /subgroup2/dataset_three
Notice the size of the integers. The NumPy integer type represents integers with 32 bits (int32).
Another short Python script reads the HDF5 file and outputs some of the attributes. This can be done fairly easily using the h5py function visit. This function recursively walks the HDF5 file so you can discover the objects in the file, including groups and data sets. With this function, you can print the "names" of the objects. Listing 2 is a simple script for walking the HDF5 file and printing the names of the objects.
Listing 2: Walking the HDF5 File
01 #!/home/laytonjb/anaconda2/bin/python 02 03 import h5py 04 import numpy as np 05 06 def printname(name): 07 print name 08 09 # =================== 10 # Main Python section 11 # =================== 12 # 13 if __name__ == '__main__': 14 15 f = h5py.File("mytestfile.hdf5", "r") 16 17 for name in f: 18 print name 19 # end for 20 21 f.visit(printname); 22 23 # end if
The output from the script is below:
[laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test2.py mydataset subgroup subgroup2 mydataset subgroup subgroup/another_dataset subgroup2 subgroup2/dataset_three
You can find more information in the HDF5 documentation. The Quick Start guide also has more examples of acessing HDF5 files from Python.
Fortran and HDF5
H5py is a very Python-centric library allowing HDF5 to be used in a very flexible manner. Compiled languages are a little different. Using HDF5 with compiled languages is not quite as easy as with Python, but it is not difficult. The developers of HDF5 have created a number of functions and subroutines to be used for manipulating data and objects in an HDF5 file that make programming straightforward.
For this article, a CentOS 7.3 OS was used with the default Fortran compiler (gfortran) and the HDF5 library that is part of the distribution. It's not difficult to build a Fortran executable with gfortran and the HDF5 library that comes with the distribution. The generic command line below illustrates how to accomplish this,
$ gfortran code.f90 -fintrinsic-modules-path /usr/lib64/gfortran/modules \ -lhdf5_fortran -o exe
where code.f90 is the source file and exe is the resultant binary.
The HDF Group has provided some sample Fortran 90 code to get started, as well as more complex examples. With the use of these examples, LIsting 3 shows a Fortran 90 version of the first sample Python code.
Listing 3: Sample Fortran 90 Code
001 PROGRAM TEST 002 003 USE HDF5 ! This module contains all necessary HDF5 modules 004 005 IMPLICIT NONE 006 007 ! Names (file and HDF5 objects) 008 CHARACTER(LEN=15), PARAMETER :: filename = "mytestfile.hdf5" ! File name 009 CHARACTER(LEN=9), PARAMETER :: dsetname1 = "mydataset" ! Dataset name 010 CHARACTER(LEN=8), PARAMETER :: groupname = "subgroup" ! Sub-Group 1 name 011 CHARACTER(LEN=9), PARAMETER :: groupname3 = "subgroup2" ! Sub-Group 3 name 012 ! Dataset 2 name 013 CHARACTER(LEN=24), PARAMETER :: dsetname2 = "subgroup/another_dataset" 014 ! Dataset 3 name 015 CHARACTER(LEN=23), PARAMETER :: dsetname3 = "subgroup2/dataset_three" 016 017 ! Identifiers 018 INTEGER(HID_T) :: file_id ! File identifier 019 INTEGER(HID_T) :: group_id ! Group identifier 020 INTEGER(HID_T) :: group3_id ! Group 3 identifier 021 INTEGER(HID_T) :: dset1_id ! Dataset 1 identifier 022 INTEGER(HID_T) :: dset2_id ! Dataset 2 identifier 023 INTEGER(HID_T) :: dset3_id ! Dataset 3 identifier 024 INTEGER(HID_T) :: dspace1_id ! Dataspace 1 identifier 025 INTEGER(HID_T) :: dspace2_id ! Dataspace 2 identifier 026 INTEGER(HID_T) :: dspace3_id ! Dataspace 3 identifier 027 028 ! Integer array 029 INTEGER :: rank ! Dataset rank 030 INTEGER(HSIZE_T), DIMENSION(1) :: dims1 = (/100/) ! Dataset dimensions 031 INTEGER(HSIZE_T), DIMENSION(1) :: data_dims1 032 INTEGER, DIMENSION(100) :: dset_data1 ! Data buffers 033 034 ! FP array 035 INTEGER(HSIZE_T), DIMENSION(1) :: dims2 = (/50/) 036 INTEGER(HSIZE_T), DIMENSION(1) :: data_dims2 037 REAL, DIMENSION(50) :: dset_data2 038 039 ! Integer array for dataset_three 040 INTEGER(HSIZE_T), DIMENSION(1) :: dims3 = (/10/) ! Dataset dimensions 041 INTEGER(HSIZE_T), DIMENSION(1) :: data_dims3 ! Dataset rank 042 INTEGER, DIMENSION(10) :: dset_data3 043 044 ! Misc variables (e.g. loop counters) 045 INTEGER :: error ! Error flag 046 INTEGER :: i,j 047 ! ===================================================================== 048 049 ! Initialize the dset_data array 050 data_dims1(1) = 100 051 rank = 1 052 DO i = 1, 100 053 dset_data1(i) = i 054 END DO 055 056 ! Initialize Fortran interface 057 CALL h5open_f(error) 058 ! Create a new file 059 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error) 060 061 ! Create dataspace 1 (the dataset is next) "dspace_id" is returned 062 CALL h5screate_simple_f(rank, dims1, dspace1_id, error) 063 ! Create dataset 1 with default properties "dset_id" is returned 064 CALL h5dcreate_f(file_id, dsetname1, H5T_NATIVE_INTEGER, dspace1_id, & 065 dset1_id, error) 066 ! Write dataset 1 067 CALL h5dwrite_f(dset1_id, H5T_NATIVE_INTEGER, dset_data1, data_dims1, & 068 error) 069 ! Close access to dataset 1 070 CALL h5dclose_f(dset1_id, error) 071 ! Close access to data space 1 072 CALL h5sclose_f(dspace1_id, error) 073 074 ! Create a group in the HDF5 file 075 CALL h5gcreate_f(file_id, groupname, group_id, error) 076 ! Close the group 077 CALL h5gclose_f(group_id, error) 078 079 ! Create dataspace 2 (the dataset is next) 080 data_dims2(1) = 50 081 DO i = 1, 50 082 dset_data2(i) = 1.0 083 END DO 084 ! Create dataspace 2 085 CALL h5screate_simple_f(rank, dims2, dspace2_id, error) 086 ! Create dataset 2 with default properties 087 CALL h5dcreate_f(file_id, dsetname2, H5T_NATIVE_REAL, dspace2_id, & 088 dset2_id, error) 089 ! Write dataset 2 090 CALL h5dwrite_f(dset2_id, H5T_NATIVE_REAL, dset_data2, data_dims2, & 091 error) 092 ! Close access to dataset 2 093 CALL h5dclose_f(dset2_id, error) 094 ! Close access to data space 2 095 CALL h5sclose_f(dspace2_id, error) 096 097 ! Create a group in the HDF5 file 098 CALL h5gcreate_f(file_id, groupname3, group3_id, error) 099 ! Close the group 100 CALL h5gclose_f(group3_id, error) 101 102 ! Create dataspace 3 103 data_dims3(1) = 10 104 DO i = 1, 10 105 dset_data3(i) = i + 3 106 END DO 107 ! Create dataspace 3 108 CALL h5screate_simple_f(rank, dims3, dspace3_id, error) 109 ! Create dataset 3 with default properties 110 CALL h5dcreate_f(file_id, dsetname3, H5T_NATIVE_INTEGER, & 111 dspace3_id, dset3_id, error) 112 ! Write dataset 3 113 CALL h5dwrite_f(dset3_id, H5T_NATIVE_INTEGER, dset_data3, data_dims3, & 114 error) 115 ! Close access to dataset 3 116 CALL h5dclose_f(dset3_id, error) 117 ! Close access to data space 3 118 CALL h5sclose_f(dspace3_id, error) 119 120 ! Close the file 121 CALL h5fclose_f(file_id, error) 122 ! Close FORTRAN interface 123 CALL h5close_f(error) 124 END PROGRAM TEST
Notice that the code uses some predefined HDF5 variables that are necessary to use the library. Also note that this isn't “good” coding, in that the error variable is not checked when returning from a subroutine call. This code is just an example, and I wanted to keep it short in the interest of space.
The basic process of using HDF5 in Fortran is pretty logical. To begin, you initialize or enable the Fortran interface (line 57); then, you open a file (line 59) and start creating objects.
The first object to be created is a dataset in the root (/) group (lines 61–72), but first, you have to create the dataspace (line 62) then the dataset (lines 64-65). Lines 67–68 write the data to the dataset. To reverse the process the process, first close the dataset (line 70) and then the dataspace (line 72).
The general approach for writing a dataset to an HDF5 file using Fortran is the following:
- Open a dataspace
- Open a dataset within the dataspace
- Write the data to the dataset
- Close the dataset
- Close the dataspace
You could easily write a function in Fortran 90 for all these steps if you desired.
In the rest of the code, the other datasets are written to the file. One interesting thing to note is that when using these subroutines, you have to use the full path to the group where you are going to write the dataset. With the h5py Python module, you can write to a group by using the method associated with the specific group.
After running the Fortran code, which has no output, a quick experiment is to run the test2.py script from the Python section against the Fortran output:
[laytonjb@laytonjb-Lenovo-G50-45 FORTRAN]$ ./test2.py mydataset subgroup subgroup2 mydataset subgroup subgroup/another_dataset subgroup2 subgroup2/dataset_three
If you compare this to the output from the Python code, you will see that they are the same.
Summary
HDF5 has many features that make it probably the most used standard file format in HPC today. It's flexible, it's multiplatform, it has a large number of language interfaces, and it's easy to use.
In this article, I showed how easy HDF5 is to use with a couple of languages – Python and Fortran. Python is not a language the HDF5 Group directly supports with their distribution, so I used a third-party interface, h5py, a Pythonic interface to HDF5 that is very, very simple to use, as demonstrated in the quick example.
I used Fortran to represent compiled languages. Fortran is one of the languages that the HDF Group supports with their distribution. Writing and reading data to an HDF5 file from Fortran is not difficult, although it takes a bit more work than Python.
In both cases, the hierarchical nature of the HDF5 format makes writing and reading data very natural for both Python and Fortran. Although the examples here are simple, it is possible to manipulate complicated data structures with either language and perform I/O very efficiently to and from an HDF5 file.
One of the key attributes of HDF5 that I didn't mention is that Parallel HDF5 is included in the source code and is available through a configure option. That is, several processes, either on the same node or on different nodes, can write to the same file at the same time. This capability can reduce the time an application spends on I/O, because all of the processes are performing a portion of the I/O. In the next article, I want to talk about using HDF5 with parallel applications.