Multiprocessing in Python with Fortran and OpenMP

Creating Python modules with Fortran OpenMP code makes all available cores accessible to Python functions.

Python is arguably one of the most popular languages in use today and is being used more and more in HPC. However, Python code by itself is limited to a single thread because of the almighty global interpreter lock (GIL), “a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.” Given servers with 32 cores per socket, it seems somewhat wasteful to use just one core when programming in Python.

A number of tools and modules allow you to write multiprocessing or multithreaded code, including the multiprocessing module that comes in the standard library, Parallel Python, variations on queuing systems such as 0MQ (zeromq), and the mpi4py bindings of the Message Passing Interface (MPI) standard for writing MPI code in Python.

Another cool aspect of Python is that it’s possible to write extensions in other languages that can be loaded as modules using “wrapper” tools. For example, you can write extensions in Fortran and C as Python modules with tools such as SWIG, Pyfort, and F2PY. Writing parallel functions in Python is very difficult, but it’s fairly straightforward in C and Fortran with the use of a variety of abstractions, including OpenMP, which provides a path for Python functions to utilize all of the available cores.

In this article, I take a look at what’s possible by writing Python modules in Fortran that use all of the cores on a node or a subset of nodes (user controllable). To achieve this, I use F2PY to build the module.

F2PY

The tool named f2py, Fortran-to-Python, creates a connection between the two languages that calls Fortran 77/90/95 code, Fortran 90/95 modules, and C functions from Python. F2PY accesses Fortran module data from Python, allowing Python functions to be called from Fortran or C (callbacks) and automatically handling the difference between multidimensional Fortran and Python array data storage order.

You can find some good documentation on the NumPy and SciPy documentation page and some good basic introductions online.

As a first example, the simple “hello world” hw.f90 OpenMP code is used as a starting point for creating a Python module. The modifications to the initial code are simple. The only change is to convert it from a “program” to a “subroutine.” No arguments are passed to the subroutine in this case, and all output (write) statements go to standard output (stdout). The code for the example can be found on the Florida State University Department of Scientific Computing site under the name hello_openmp.f90 (GNU LGPL license). Listing 1 shows a modified subroutine ready to be built into a Python module.

Listing 1: OpenMP Hello World (hw.f90)

subroutine hello()
 
!*****************************************************************************80
!
!! MAIN is the main program for HELLO.
!
! Discussion:
!
! HELLO is a "Hello, World" program for OpenMP.
!
! Licensing:
!
! This code is distributed under the GNU LGPL license. 
!
! Modified:
!
! 23 June 2010
!
! Author:
!
! John Burkardt
!
 use omp_lib
 
 implicit none
 
 integer ( kind = 4 ) id
 real ( kind = 8 ) wtime
 
 write ( *, '(a)' ) ' '
 write ( *, '(a)' ) 'HELLO_OPENMP'
 write ( *, '(a)' ) ' FORTRAN90/OpenMP version'
 
 wtime = omp_get_wtime ( )
 
 write ( *, '(a,i8)' ) &
 ' The number of processors available = ', omp_get_num_procs ( )
 write ( *, '(a,i8)' ) &
 ' The number of threads available = ', omp_get_max_threads ( )
!
! OUTSIDE THE PARALLEL REGION, have each thread say hello (there's only 1).
!
 id = omp_get_thread_num ( )
 
 write ( *, '(a)' ) ' '
 write ( *, '(a)' ) ' OUTSIDE the parallel region.'
 write ( *, '(a)' ) ' '
 
 write ( *, '(a,i8,a,i8)' ) ' HELLO from process ', id
 
 write ( *, '(a)' ) ' '
 write ( *, '(a)' ) ' Going INSIDE the parallel region:'
 write ( *, '(a)' ) ' '
!
! INSIDE THE PARALLEL REGION, have each thread say hello.
!
!$omp parallel &
!$omp private ( id )
 id = omp_get_thread_num ( )
 
 write ( *, '(a,i8,a,i8)' ) ' HELLO from process ', id
 
!$omp end parallel
!
! Finish up by measuring the elapsed time.
!
 wtime = omp_get_wtime ( ) - wtime
 
 write ( *, '(a)' ) ' '
 write ( *, '(a)' ) ' Back OUTSIDE the parallel region.'
 write ( *, '(a)' ) ' '
 write ( *, '(a,g14.6)' ) ' Elapsed wall clock time = ', wtime
!
! Terminate.
!
 write ( *, '(a)' ) ' '
 write ( *, '(a)' ) 'HELLO_OPENMP'
 write ( *, '(a)' ) ' Normal end of execution.'
 
 !stop
end

For testing, I used f2py to build the module, using the GFortran compiler version 7.1, and Conda version 4.3.21 with Python 2.7.13; f2py is installed using conda.

Building the module is fairly easy by using “the quick and smart way” (Listing 2). The output from F2PY is somewhat verbose even run without “verbose” turned on."

Listing 2: Building Hello OpenMP

f2py --f90flags=-fopenmp -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -lgomp -c -m hw hw.f90
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_src
build_src
building extension "hw" sources
f2py options: []
f2py:> /tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.c
creating /tmp/tmpKa8a4p/src.linux-x86_64-2.7
Reading fortran codes...
 Reading file 'hw.f90' (format:free)
Post-processing...
 Block: hw
 Block: hello
In: :hw:hw.f90:hello
get_useparameters: no module omp_lib info used by hello
Post-processing (stage 2)...
Building modules...
 Building module "hw"...
 Constructing wrapper function "hello"...
 hello()
 Wrote C/API module "hw" to file "/tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.c"
 adding '/tmp/tmpKa8a4p/src.linux-x86_64-2.7/fortranobject.c' to sources.
 adding '/tmp/tmpKa8a4p/src.linux-x86_64-2.7' to include_dirs.
 
...
 
compiling Fortran sources
Fortran f77 compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops
Fortran f90 compiler: /usr/bin/gfortran -fopenmp -fPIC -O3 -funroll-loops
Fortran fix compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fopenmp -fPIC -O3 -funroll-loops
compile options: '-I/tmp/tmpKa8a4p/src.linux-x86_64-2.7 \
                  -I/home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/core/include \
                  -I/home/laytonjb/anaconda2/include/python2.7 -c'
gfortran:f90: hw.f90
/usr/bin/gfortran -Wall -g -Wall -g -shared \
   /tmp/tmpKa8a4p/tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.o \
   /tmp/tmpKa8a4p/tmp/tmpKa8a4p/src.linux-x86_64-2.7/fortranobject.o \
   /tmp/tmpKa8a4p/hw.o -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ \
   -L/home/laytonjb/anaconda2/lib -lgomp -lpython2.7 -lgfortran -o ./hw.so
Removing build directory /tmp/tmpKa8a4p

The build creates a file named hw.so (shareable object or shareable library), which is in a form that can be “imported” into Python. The Python module can now be tested. Listing 3 shows the output from Python when the module is used. Notice that the Python module is named hw and the function is hw.hello (hello is a member function of the object hw). Also, notice that all four processors responded, so the OpenMP portion of the code is working.

Listing 3: Output of hw Python Module

[laytonjb@laytonjb-Lenovo-G50-45 F2PY]$ python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import hw
>>> print hw.hello.__doc__
hello()
 
Wrapper for ``hello``.
 
 
>>> print hw.hello()
 
HELLO_OPENMP
 FORTRAN90/OpenMP version
 The number of processors available = 4
 The number of threads available = 4
  
 OUTSIDE the parallel region.
  
 HELLO from process 0
  
 Going INSIDE the parallel region:
  
 HELLO from process 0
 HELLO from process 2
 HELLO from process 3
 HELLO from process 1
 
 Back OUTSIDE the parallel region.
 
 Elapsed wall clock time = 0.606714E-03
 
HELLO_OPENMP
 Normal end of execution.
None
>>>

In the next, more practical example, the Fortran code finds the minimum of a list (array) by breaking it into parts and assigning each part to a thread using OpenMP (Listing 4). It is converted to a function that returns the minimum. For inputs, you need to pass the list (array) along with the length of the array, with some help to F2PY by giving it a little more information about the inputs to the function. In this case, it helps to explain that the input array is only used as an input and not an output; also, the function defines the return value.

Listing 4: Finding a List (Array) Minimum (test.f90)

REAL*8 FUNCTION OMPMIN(X, NMAX)
! 
! X is array, N is length of array
!
 INTEGER NMAX
 REAL*8 X(NMAX)
!f2py real(8), intent(in), dimension(nmax) :: x
 
 INTEGER :: NUM_THREADS
 INTEGER :: I, N
 INTEGER :: ITHREAD, ISTART, ISTOP
  
 REAL*8 MY_MIN(10)
 REAL*8 RMIN
 
 INTEGER, EXTERNAL :: OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS 
 
 
!$OMP PARALLEL PRIVATE(I, ITHREAD, ISTART, ISTOP, NUM_THREADS, N)
 
 NUM_THREADS = OMP_GET_NUM_THREADS()
 N = NMAX/NUM_THREADS
 
 ITHREAD = OMP_GET_THREAD_NUM()
 
 IF ( ITHREAD == 0 ) THEN
 PRINT *, "num_threads = ", NUM_THREADS
 PRINT *, "n = ", N
 ENDIF
 
 ! ----------------------------------
 ! Find my own starting index
 ! ----------------------------------
 ISTART = ITHREAD * N + 1 ! Array start at 1
 
 ! ----------------------------------
 ! Find my own stopping index
 ! ----------------------------------
 
 ISTOP = NMAX
 IF (ITHREAD == (NUM_THREADS-1) ) THEN
 ISTOP = ISTART + N
 ENDIF
 
 ! ----------------------------------
 ! Find my own min
 ! ----------------------------------
 MY_MIN(ITHREAD+1) = X(ISTART)
 
 DO I = ISTART+1, ISTOP
 IF ( X(I) < MY_MIN(ITHREAD+1) ) THEN
 MY_MIN(ITHREAD+1) = X(I)
 END IF 
 END DO
 
!$OMP END PARALLEL
 
 RMIN = MY_MIN(1)
 
 DO I = 2, NUM_THREADS
 IF ( RMIN < MY_MIN(I) ) THEN
 RMIN = MY_MIN(I)
 END IF 
 END DO
 
 
 PRINT *, "min = ", RMIN
 OMPMIN = RMIN
 
 RETURN
 
 END

The PRINT statements were left in the code to make sure it was performing correctly (i.e., for debugging). The F2PY command to build the Python module and its output, in part, are shown in Listing 5. F2PY creates a file named test.so (shareable object or shareable library). The simple Python script in Listing 6 tests this module.

Listing 5: Building test.f90

$ f2py --f90flags=-fopenmp -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -lgomp -c -m test test.f90
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_src
build_src
building extension "test" sources
f2py options: []
f2py:> /tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.c
creating /tmp/tmpTtj5vP/src.linux-x86_64-2.7
Reading fortran codes...
 Reading file 'test.f90' (format:free)
Post-processing...
 Block: test
 Block: ompmin
Post-processing (stage 2)...
Building modules...
 Building module "test"...
 Creating wrapper for Fortran function "ompmin"("ompmin")...
 Constructing wrapper function "ompmin"...
 ompmin = ompmin(x,[nmax])
 Wrote C/API module "test" to file "/tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.c"
 Fortran 77 wrappers are saved to "/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f"
 adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7/fortranobject.c' to sources.
 adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7' to include_dirs.
copying /home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.c -> \
        /tmp/tmpTtj5vP/src.linux-x86_64-2.7
copying /home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.h -> \
        /tmp/tmpTtj5vP/src.linux-x86_64-2.7
 adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f' to sources.
 
...
 
compiling Fortran sources
Fortran f77 compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops
Fortran f90 compiler: /usr/bin/gfortran -fopenmp -fPIC -O3 -funroll-loops
Fortran fix compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fopenmp -fPIC -O3 -funroll-loops
compile options: '-I/tmp/tmpTtj5vP/src.linux-x86_64-2.7 \
                  -I/home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/core/include \
                  -I/home/laytonjb/anaconda2/include/python2.7 -c'
gfortran:f90: test.f90
gfortran:f77: /tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f
/usr/bin/gfortran -Wall -g -Wall -g -shared \
   /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.o \
   /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/fortranobject.o \
   /tmp/tmpTtj5vP/test.o \
   /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.o \
   -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -L/home/laytonjb/anaconda2/lib \
   -lgomp -lpython2.7 -lgfortran -o ./test.so
Removing build directory /tmp/tmpTtj5vP

Listing 6: Running test.so

#!/home/laytonjb/anaconda2/bin/python
import test
 
def frange(x, y, jump):
 while x < y:
 yield x
 x += jump
 
if __name__ == '__main__':
 
 A=list(frange(1.0,100000000.0,1.0))
 A.append(0.1)
 print test.ompmin(A)
 
# end

The correct minimum is 0.1 (the last element of the array). The output from running the script is:

$ ./mintest.py 
 num_threads = 4
 n = 25000000
 min = 0.10000000000000001 
0.1

Bingo! The correct answer using OpenMP!

Summary

Python is somewhat limited in its ability to use all of the cores that are available on today’s systems, and most ways around the limitation seem somewhat “anti-Pythonic.” However, one of the really cool things about Python is that it’s fairly easy to write extensions. Using tools such as F2PY, you can simply take Fortran or C code and create a Python module.

Python extensions can be written to take advantage of multiple cores using a variety of programming models. A fairly straightforward programming model is OpenMP. In this article, I illustrated two simple examples of creating Python modules with Fortran OpenMP code. With the help of F2PY, it is not difficult to take multicore code in another language and make it accessible from Python.