Simple HDF5 in Python and Fortran

Using the popular HDF5 I/O library with Python and Fortran.

HDF5 is one of the most popular I/O libraries in HPC. It uses a familiar filesystem hierarchy; it is flexible, self-describing, and portable across operating systems and hardware; can store text and binary data, can be used by parallel applications (MPI), has a large number of language plugins; and is fairly easy to use.

In a previous article, I introduced HDF5, focusing on the concepts and strengths. In this article. I want to give a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics of using it. I'll start with Python because it is a widely used language and the HDF5 Python library h5py is very easy to use and very easy to understand. I also want to illustrate how to use HDF5 with a compiled language. In particular, I want to use Fortran for illustrating how a compiled program works with HDF5.

h5py Python Library

H5py is the dominant Python interface to HDF5. It is included with many Python distributions and with most Linux distributions. For the examples here, I use the Anaconda Python distribution for Python 2.7.

The examples I use in this article are fairly simple and are derived from the Quick Start page on the h5py website. The first example simply illustrates a few concepts, such as:

  • Opening an HDF5 file for writing
  • Creating data sets
  • Creating groups

The simple Python script in Listing 1 incorporates these concepts.

Listing 1: Starting Out with h5py

01   #!/home/laytonjb/anaconda2/bin/python
02 
03   import h5py
04   import numpy as np
05 
06   # ===================
07   # Main Python section
08   # ===================
09   #
10  if __name__ == '__main__':
11 
12      f = h5py.File("mytestfile.hdf5", "w")
13  
14      dset = f.create_dataset("mydataset", (100,), dtype='i')
15    
16      dset[...] = np.arange(100)
17 
18      print "dset.shape = ",dset.shape
19 
20      print "dset.dtype = ",dset.dtype
21 
22      print "dset.name = ",dset.name
23   
24      print "f.name = ",f.name
25   
26      grp = f.create_group("subgroup");
27    
28      dset2 = grp.create_dataset("another_dataset", (50,), dtype='f');
29      print "dset2.name = ",dset2.name
30    
31      dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i')
32      print "dset3.name = ",dset3.name
33   
34  # end if

The first h5py command is line 12, which opens a file for writing. If the file exists, it will overwrite; if it doesn't exist, it will create the file. Remember that HDF5 is really a container for data objects. When you create a file, the library creates a number of defaults, such as the root group (/). Therefore the file will be non-zero in size, even if no data or attributes are written into it.

After the file is opened and created, a data set with 100 integers is created (mydataset in line 14). At this point, only the object for the dataset is created in the file (dataspace). Line 16 puts data into the data object using numpy. Notice that you can put data into the object, and the h5py library will take care of updating the HDF5 file. We could also modify the data in the file.

Recall that in Python, almost everything is an object, so it has properties. Lines 18, 20, 22, and 24 print out some of the properties of the HDF5 file (line 24) as well as the first data set (lines 18, 20, and 22). Because HDF5 is object based, it fits well with the object nature of Python.

On line 26, a subgroup to the root group (subgroup) is created; then, on line 28, a new data set that resides in this subgroup is created using a float data type that starts with 50 elements. Notice that a method of the group object is used for this.

On line 31, a new dataset is created. What is unique is that the dataset is created in a new subgroup named subgroup2. H5py will automatically create the subgroup if it doesn't exist.

The output from this example Python script is show below:

[laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test.py
dset.hsape =  (100,)
dset.dtype =  int32
dset.name =  /mydataset
f.name =  /
dset2.name =  /subgroup/another_dataset
dset3.name =  /subgroup2/dataset_three

Notice the size of the integers. The NumPy integer type represents integers with 32 bits (int32).

Another short Python script reads the HDF5 file and outputs some of the attributes. This can be done fairly easily using the h5py function visit. This function recursively walks the HDF5 file so you can discover the objects in the file, including groups and data sets. With this function, you can print the "names" of the objects. Listing 2 is a simple script for walking the HDF5 file and printing the names of the objects.

Listing 2: Walking the HDF5 File

01   #!/home/laytonjb/anaconda2/bin/python
02 
03   import h5py
04   import numpy as np
05 
06   def printname(name):
07       print name
08 
09   # ===================
10  # Main Python section
11  # ===================
12  #
13  if __name__ == '__main__':
14 
15      f = h5py.File("mytestfile.hdf5", "r")
16 
17      for name in f:
18          print name
19      # end for
20 
21      f.visit(printname);
22 
23  # end if

The output from the script is below:

[laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test2.py
mydataset
subgroup
subgroup2
mydataset
subgroup
subgroup/another_dataset
subgroup2
subgroup2/dataset_three

You can find more information in the HDF5 documentation. The Quick Start guide also has more examples of acessing HDF5 files from Python.

Fortran and HDF5

H5py is a very Python-centric library allowing HDF5 to be used in a very flexible manner. Compiled languages are a little different. Using HDF5 with compiled languages is not quite as easy as with Python, but it is not difficult. The developers of HDF5 have created a number of functions and subroutines to be used for manipulating data and objects in an HDF5 file that make programming straightforward.

For this article, a CentOS 7.3 OS was used with the default Fortran compiler (gfortran) and the HDF5 library that is part of the distribution. It's not difficult to build a Fortran executable with gfortran and the HDF5 library that comes with the distribution. The generic command line below illustrates how to accomplish this,

$ gfortran code.f90 -fintrinsic-modules-path /usr/lib64/gfortran/modules \
   -lhdf5_fortran -o exe

where code.f90 is the source file and exe is the resultant binary.

The HDF Group has provided some sample Fortran 90 code to get started, as well as more complex examples. With the use of these examples, LIsting 3 shows a Fortran 90 version of the first sample Python code.

Listing 3: Sample Fortran 90 Code

001   PROGRAM TEST
002  
003       USE HDF5 ! This module contains all necessary HDF5 modules
004  
005       IMPLICIT NONE
006
007       ! Names (file and HDF5 objects)
008       CHARACTER(LEN=15), PARAMETER :: filename = "mytestfile.hdf5" ! File name
009       CHARACTER(LEN=9), PARAMETER :: dsetname1 = "mydataset" ! Dataset name
010      CHARACTER(LEN=8), PARAMETER :: groupname = "subgroup" ! Sub-Group 1 name
011      CHARACTER(LEN=9), PARAMETER :: groupname3 = "subgroup2" ! Sub-Group 3 name
012      ! Dataset 2 name
013      CHARACTER(LEN=24), PARAMETER :: dsetname2 = "subgroup/another_dataset"
014      ! Dataset 3 name
015      CHARACTER(LEN=23), PARAMETER :: dsetname3 = "subgroup2/dataset_three"
016      
017      ! Identifiers
018      INTEGER(HID_T) :: file_id       ! File identifier
019      INTEGER(HID_T) :: group_id      ! Group identifier
020      INTEGER(HID_T) :: group3_id     ! Group 3 identifier
021      INTEGER(HID_T) :: dset1_id      ! Dataset 1 identifier
022      INTEGER(HID_T) :: dset2_id      ! Dataset 2 identifier
023      INTEGER(HID_T) :: dset3_id      ! Dataset 3 identifier
024      INTEGER(HID_T) :: dspace1_id    ! Dataspace 1 identifier
025      INTEGER(HID_T) :: dspace2_id    ! Dataspace 2 identifier
026      INTEGER(HID_T) :: dspace3_id    ! Dataspace 3 identifier
027    
028      ! Integer array
029      INTEGER :: rank                 ! Dataset rank
030      INTEGER(HSIZE_T), DIMENSION(1) :: dims1 = (/100/) ! Dataset dimensions
031      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims1
032      INTEGER, DIMENSION(100) :: dset_data1   ! Data buffers
033 
034      ! FP array
035      INTEGER(HSIZE_T), DIMENSION(1) :: dims2 = (/50/)
036      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims2
037      REAL, DIMENSION(50) :: dset_data2
038    
039      ! Integer array for dataset_three
040      INTEGER(HSIZE_T), DIMENSION(1) :: dims3 = (/10/) ! Dataset dimensions
041      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims3     ! Dataset rank
042      INTEGER, DIMENSION(10) :: dset_data3
043    
044      ! Misc variables (e.g. loop counters)
045      INTEGER :: error ! Error flag
046      INTEGER :: i,j
047  ! =====================================================================
048 
049      ! Initialize the dset_data array 
050      data_dims1(1) = 100
051      rank = 1
052      DO i = 1, 100
053          dset_data1(i) = i
054      END DO
055    
056      ! Initialize Fortran interface
057      CALL h5open_f(error)   
058      ! Create a new file
059      CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error)
060 
061      ! Create dataspace 1 (the dataset is next) "dspace_id" is returned
062      CALL h5screate_simple_f(rank, dims1, dspace1_id, error)
063      ! Create dataset 1 with default properties "dset_id" is returned
064      CALL h5dcreate_f(file_id, dsetname1, H5T_NATIVE_INTEGER, dspace1_id, &
065                       dset1_id, error)
066      ! Write dataset 1
067      CALL h5dwrite_f(dset1_id, H5T_NATIVE_INTEGER, dset_data1, data_dims1, &
068                      error)
069      ! Close access to dataset 1
070      CALL h5dclose_f(dset1_id, error)
071      ! Close access to data space 1
072      CALL h5sclose_f(dspace1_id, error)
073    
074      ! Create a group in the HDF5 file
075      CALL h5gcreate_f(file_id, groupname, group_id, error)
076      ! Close the group
077      CALL h5gclose_f(group_id, error)
078 
079      ! Create dataspace 2 (the dataset is next)
080      data_dims2(1) = 50
081      DO i = 1, 50
082          dset_data2(i) = 1.0
083      END DO
084      ! Create dataspace 2
085      CALL h5screate_simple_f(rank, dims2, dspace2_id, error)
086      ! Create dataset 2 with default properties
087      CALL h5dcreate_f(file_id, dsetname2, H5T_NATIVE_REAL, dspace2_id, &
088                       dset2_id, error)
089      ! Write dataset 2
090      CALL h5dwrite_f(dset2_id, H5T_NATIVE_REAL, dset_data2, data_dims2, &
091                      error)
092      ! Close access to dataset 2
093      CALL h5dclose_f(dset2_id, error)
094      ! Close access to data space 2
095      CALL h5sclose_f(dspace2_id, error)
096    
097      ! Create a group in the HDF5 file
098      CALL h5gcreate_f(file_id, groupname3, group3_id, error)
099      ! Close the group
100     CALL h5gclose_f(group3_id, error)
101    
102     ! Create dataspace 3
103     data_dims3(1) = 10
104     DO i = 1, 10
105         dset_data3(i) = i + 3
106     END DO
107     ! Create dataspace 3
108     CALL h5screate_simple_f(rank, dims3, dspace3_id, error)
109     ! Create dataset 3 with default properties
110     CALL h5dcreate_f(file_id, dsetname3, H5T_NATIVE_INTEGER,  &
111                      dspace3_id, dset3_id, error)
112     ! Write dataset 3
113     CALL h5dwrite_f(dset3_id, H5T_NATIVE_INTEGER, dset_data3, data_dims3, &
114                     error)
115     ! Close access to dataset 3
116     CALL h5dclose_f(dset3_id, error)
117     ! Close access to data space 3
118     CALL h5sclose_f(dspace3_id, error)
119    
120     ! Close the file
121     CALL h5fclose_f(file_id, error)
122     ! Close FORTRAN interface
123     CALL h5close_f(error)
124  END PROGRAM TEST

Notice that the code uses some predefined HDF5 variables that are necessary to use the library. Also note that this isn't “good” coding, in that the error variable is not checked when returning from a subroutine call. This code is just an example, and I wanted to keep it short in the interest of space.

The basic process of using HDF5 in Fortran is pretty logical. To begin, you initialize or enable the Fortran interface (line 57); then, you open a file (line 59) and start creating objects.

The first object to be created is a dataset in the root (/) group (lines 61–72), but first, you have to create the dataspace (line 62) then the dataset (lines 64-65). Lines 67–68 write the data to the dataset. To reverse the process the process, first close the dataset (line 70) and then the dataspace (line 72).

The general approach for writing a dataset to an HDF5 file using Fortran is the following:

  • Open a dataspace
  • Open a dataset within the dataspace
  • Write the data to the dataset
  • Close the dataset
  • Close the dataspace

You could easily write a function in Fortran 90 for all these steps if you desired.

In the rest of the code, the other datasets are written to the file. One interesting thing to note is that when using these subroutines, you have to use the full path to the group where you are going to write the dataset. With the h5py Python module, you can write to a group by using the method associated with the specific group.

After running the Fortran code, which has no output, a quick experiment is to run the test2.py script from the Python section against the Fortran output:

[laytonjb@laytonjb-Lenovo-G50-45 FORTRAN]$ ./test2.py
mydataset
subgroup
subgroup2
mydataset
subgroup
subgroup/another_dataset
subgroup2
subgroup2/dataset_three

If you compare this to the output from the Python code, you will see that they are the same.

Summary

HDF5 has many features that make it probably the most used standard file format in HPC today. It's flexible, it's multiplatform, it has a large number of language interfaces, and it's easy to use.

In this article, I showed how easy HDF5 is to use with a couple of languages – Python and Fortran. Python is not a language the HDF5 Group directly supports with their distribution, so I used a third-party interface, h5py, a Pythonic interface to HDF5 that is very, very simple to use, as demonstrated in the quick example.

I used Fortran to represent compiled languages. Fortran is one of the languages that the HDF Group supports with their distribution. Writing and reading data to an HDF5 file from Fortran is not difficult, although it takes a bit more work than Python.

In both cases, the hierarchical nature of the HDF5 format makes writing and reading data very natural for both Python and Fortran. Although the examples here are simple, it is possible to manipulate complicated data structures with either language and perform I/O very efficiently to and from an HDF5 file.

One of the key attributes of HDF5 that I didn't mention is that Parallel HDF5 is included in the source code and is available through a configure option. That is, several processes, either on the same node or on different nodes, can write to the same file at the same time. This capability can reduce the time an application spends on I/O, because all of the processes are performing a portion of the I/O. In the next article, I want to talk about using HDF5 with parallel applications.

Tags: Fortran Fortran , h5py h5py , HDF5 HDF5 , Python Python