Multiprocessing in Python with Fortran and OpenMP
Den of Snakes
Python is arguably one of the most popular languages in use today and is being used more and more in HPC. However, Python code by itself is limited to a single thread because of the almighty global interpreter lock (GIL) [1], "a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once." Given servers with 32 cores per socket, it seems somewhat wasteful to use just one core when programming in Python.
A number of tools and modules allow you to write multiprocessing or multithreaded code, including the multiprocessing [2] module that comes in the standard library, Parallel Python [3], variations on queuing systems such as 0MQ [4] (zeromq ), and the mpi4py [5] bindings of the Message Passing Interface (MPI) standard for writing MPI code in Python.
Another cool aspect of Python is that it's possible to write extensions in other languages that can be loaded as modules using "wrapper" tools. For example, you can write extensions in Fortran and C as Python modules with tools such as SWIG [6], Pyfort [7], and F2PY [8]. Writing parallel functions in Python is very difficult, but it's fairly straightforward in C and Fortran with the use of a variety of abstractions, including OpenMP, which provides a path for Python functions to utilize all of the available cores.
In this article, I take a look at what's possible by writing Python modules in Fortran that use all of the cores on a node or a subset of nodes (user controllable). To achieve this, I use F2PY to build the module.
F2PY
The tool named F2PY [8], Fortran-to-Python, creates a connection between the two languages that calls Fortran 77/90/95 code, Fortran 90/95 modules, and C functions from Python. F2PY accesses Fortran module data from Python, allowing Python functions to be called from Fortran or C (callbacks) and automatically handling the difference between multidimensional Fortran and Python array data storage order.
You can find some good documentation on the NumPy and SciPy documentation page [8] and some good basic introductions online [9].
As a first example, the simple "hello world" hw.f90
OpenMP code is used as a starting point for creating a Python module. The modifications to the initial code are simple. The only change is to convert it from a "program" to a "subroutine." No arguments are passed to the subroutine in this case, and all output (write
) statements go to standard output (stdout). The code for the example can be found on the Florida State University Department of Scientific Computing site [10] under the name hello_openmp.f90
(GNU LGPL license) [11]. Listing 1 shows a modified subroutine ready to be built into a Python module.
Listing 1
OpenMP Hello World (hw.f90)
subroutine hello() !*****************************************************************************80 ! !! MAIN is the main program for HELLO. ! ! Discussion: ! ! HELLO is a "Hello, World" program for OpenMP. ! ! Licensing: ! ! This code is distributed under the GNU LGPL license. ! ! Modified: ! ! 23 June 2010 ! ! Author: ! ! John Burkardt ! use omp_lib implicit none integer ( kind = 4 ) id real ( kind = 8 ) wtime write ( *, '(a)' ) ' ' write ( *, '(a)' ) 'HELLO_OPENMP' write ( *, '(a)' ) ' FORTRAN90/OpenMP version' wtime = omp_get_wtime ( ) write ( *, '(a,i8)' ) & ' The number of processors available = ', omp_get_num_procs ( ) write ( *, '(a,i8)' ) & ' The number of threads available = ', omp_get_max_threads ( ) ! ! OUTSIDE THE PARALLEL REGION, have each thread say hello (there's only 1). ! id = omp_get_thread_num ( ) write ( *, '(a)' ) ' ' write ( *, '(a)' ) ' OUTSIDE the parallel region.' write ( *, '(a)' ) ' ' write ( *, '(a,i8,a,i8)' ) ' HELLO from process ', id write ( *, '(a)' ) ' ' write ( *, '(a)' ) ' Going INSIDE the parallel region:' write ( *, '(a)' ) ' ' ! ! INSIDE THE PARALLEL REGION, have each thread say hello. ! !$omp parallel & !$omp private ( id ) id = omp_get_thread_num ( ) write ( *, '(a,i8,a,i8)' ) ' HELLO from process ', id !$omp end parallel ! ! Finish up by measuring the elapsed time. ! wtime = omp_get_wtime ( ) - wtime write ( *, '(a)' ) ' ' write ( *, '(a)' ) ' Back OUTSIDE the parallel region.' write ( *, '(a)' ) ' ' write ( *, '(a,g14.6)' ) ' Elapsed wall clock time = ', wtime ! ! Terminate. ! write ( *, '(a)' ) ' ' write ( *, '(a)' ) 'HELLO_OPENMP' write ( *, '(a)' ) ' Normal end of execution.' !stop end
For testing, I used f2py
to build the module, using the GFortran compiler version 7.1, and Conda [12] version 4.3.21 with Python 2.7.13; f2py
is installed using conda
. Building the module is fairly easy by using "the quick and smart way" [13](Listing 2). The output from f2py
is somewhat verbose even run without "verbose" turned on."
Listing 2
Building Hello OpenMP
f2py --f90flags=-fopenmp -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -lgomp -c -m hw hw.f90 running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands --compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options running build_src build_src building extension "hw" sources f2py options: [] f2py:> /tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.c creating /tmp/tmpKa8a4p/src.linux-x86_64-2.7 Reading fortran codes... Reading file 'hw.f90' (format:free) Post-processing... Block: hw Block: hello In: :hw:hw.f90:hello get_useparameters: no module omp_lib info used by hello Post-processing (stage 2)... Building modules... Building module "hw"... Constructing wrapper function "hello"... hello() Wrote C/API module "hw" to file "/tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.c" adding '/tmp/tmpKa8a4p/src.linux-x86_64-2.7/fortranobject.c' to sources. adding '/tmp/tmpKa8a4p/src.linux-x86_64-2.7' to include_dirs. ... compiling Fortran sources Fortran f77 compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops Fortran f90 compiler: /usr/bin/gfortran -fopenmp -fPIC -O3 -funroll-loops Fortran fix compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fopenmp -fPIC -O3 -funroll-loops compile options: '-I/tmp/tmpKa8a4p/src.linux-x86_64-2.7 \ -I/home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/core/include \ -I/home/laytonjb/anaconda2/include/python2.7 -c' gfortran:f90: hw.f90 /usr/bin/gfortran -Wall -g -Wall -g -shared \ /tmp/tmpKa8a4p/tmp/tmpKa8a4p/src.linux-x86_64-2.7/hwmodule.o \ /tmp/tmpKa8a4p/tmp/tmpKa8a4p/src.linux-x86_64-2.7/fortranobject.o \ /tmp/tmpKa8a4p/hw.o -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ \ -L/home/laytonjb/anaconda2/lib -lgomp -lpython2.7 -lgfortran -o ./hw.so Removing build directory /tmp/tmpKa8a4p
The build creates a file named hw.so
(shareable object or shareable library), which is in a form that can be "imported" into Python. The Python module can now be tested. Listing 3 shows the output from Python when the module is used. Notice that the Python module is named hw
and the function is hw.hello
(hello
is a member function of the object hw
). Also, notice that all four processors responded, so the OpenMP portion of the code is working.
Listing 3
Output of hw Python Module
[laytonjb@laytonjb-Lenovo-G50-45 F2PY]$ python Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org >>> import hw >>> print hw.hello.__doc__ hello() Wrapper for ``hello``. >>> print hw.hello() HELLO_OPENMP FORTRAN90/OpenMP version The number of processors available = 4 The number of threads available = 4 OUTSIDE the parallel region. HELLO from process 0 Going INSIDE the parallel region: HELLO from process 0 HELLO from process 2 HELLO from process 3 HELLO from process 1 Back OUTSIDE the parallel region. Elapsed wall clock time = 0.606714E-03 HELLO_OPENMP Normal end of execution. None >>>
In the next, more practical example [14], the Fortran code finds the minimum of a list (array) by breaking it into parts and assigning each part to a thread using OpenMP (Listing 4). It is converted to a function that returns the minimum. For inputs, you need to pass the list (array) along with the length of the array, with some help to F2PY by giving it a little more information about the inputs to the function. In this case, it helps to explain that the input array is only used as an input and not an output; also, the function defines the return value.
Listing 4
Finding a List (Array) Minimum (test.f90)
REAL*8 FUNCTION OMPMIN(X, NMAX) ! ! X is array, N is length of array ! INTEGER NMAX REAL*8 X(NMAX) !f2py real(8), intent(in), dimension(nmax) :: x INTEGER :: NUM_THREADS INTEGER :: I, N INTEGER :: ITHREAD, ISTART, ISTOP REAL*8 MY_MIN(10) REAL*8 RMIN INTEGER, EXTERNAL :: OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS !$OMP PARALLEL PRIVATE(I, ITHREAD, ISTART, ISTOP, NUM_THREADS, N) NUM_THREADS = OMP_GET_NUM_THREADS() N = NMAX/NUM_THREADS ITHREAD = OMP_GET_THREAD_NUM() IF ( ITHREAD == 0 ) THEN PRINT *, "num_threads = ", NUM_THREADS PRINT *, "n = ", N ENDIF ! ---------------------------------- ! Find my own starting index ! ---------------------------------- ISTART = ITHREAD * N + 1 ! Array start at 1 ! ---------------------------------- ! Find my own stopping index ! ---------------------------------- ISTOP = NMAX IF (ITHREAD == (NUM_THREADS-1) ) THEN ISTOP = ISTART + N ENDIF ! ---------------------------------- ! Find my own min ! ---------------------------------- MY_MIN(ITHREAD+1) = X(ISTART) DO I = ISTART+1, ISTOP IF ( X(I) < MY_MIN(ITHREAD+1) ) THEN MY_MIN(ITHREAD+1) = X(I) END IF END DO !$OMP END PARALLEL RMIN = MY_MIN(1) DO I = 2, NUM_THREADS IF ( RMIN < MY_MIN(I) ) THEN RMIN = MY_MIN(I) END IF END DO PRINT *, "min = ", RMIN OMPMIN = RMIN RETURN END
The PRINT
statements were left in the code to make sure it was performing correctly (i.e., for debugging). The f2py
command to build the Python module and its output, in part, are shown in Listing 5. F2PY creates a file named test.so
(shareable object or shareable library). The simple Python script in Listing 6 tests this module.
Listing 5
Building test.f90
$ f2py --f90flags=-fopenmp -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -lgomp -c -m test test.f90 running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands --compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options running build_src build_src building extension "test" sources f2py options: [] f2py:> /tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.c creating /tmp/tmpTtj5vP/src.linux-x86_64-2.7 Reading fortran codes... Reading file 'test.f90' (format:free) Post-processing... Block: test Block: ompmin Post-processing (stage 2)... Building modules... Building module "test"... Creating wrapper for Fortran function "ompmin"("ompmin")... Constructing wrapper function "ompmin"... ompmin = ompmin(x,[nmax]) Wrote C/API module "test" to file "/tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.c" Fortran 77 wrappers are saved to "/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f" adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7/fortranobject.c' to sources. adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7' to include_dirs. copying /home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.c -> \ /tmp/tmpTtj5vP/src.linux-x86_64-2.7 copying /home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.h -> \ /tmp/tmpTtj5vP/src.linux-x86_64-2.7 adding '/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f' to sources. ... compiling Fortran sources Fortran f77 compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fPIC -O3 -funroll-loops Fortran f90 compiler: /usr/bin/gfortran -fopenmp -fPIC -O3 -funroll-loops Fortran fix compiler: /usr/bin/gfortran -Wall -g -ffixed-form -fno-second-underscore -fopenmp -fPIC -O3 -funroll-loops compile options: '-I/tmp/tmpTtj5vP/src.linux-x86_64-2.7 \ -I/home/laytonjb/anaconda2/lib/python2.7/site-packages/numpy/core/include \ -I/home/laytonjb/anaconda2/include/python2.7 -c' gfortran:f90: test.f90 gfortran:f77: /tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.f /usr/bin/gfortran -Wall -g -Wall -g -shared \ /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/testmodule.o \ /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/fortranobject.o \ /tmp/tmpTtj5vP/test.o \ /tmp/tmpTtj5vP/tmp/tmpTtj5vP/src.linux-x86_64-2.7/test-f2pywrappers.o \ -L/usr/lib/gcc/x86_64-redhat-linux/4.8.2/ -L/home/laytonjb/anaconda2/lib \ -lgomp -lpython2.7 -lgfortran -o ./test.so Removing build directory /tmp/tmpTtj5vP
Listing 6
Running test.so
#!/home/laytonjb/anaconda2/bin/python import test def frange(x, y, jump): while x < y: yield x x += jump if __name__ == '__main__': A=list(frange(1.0,100000000.0,1.0)) A.append(0.1) print test.ompmin(A) # end@KE:
The correct minimum is 0.1 (the last element of the array). The output from running the script is:
$ ./mintest.py num_threads = 4 n = 25000000 min = 0.10000000000000001 0.1
Bingo! The correct answer using OpenMP!
Summary
Python is somewhat limited in its ability to use all of the cores that are available on today's systems, and most ways around the limitation seem somewhat "anti-Pythonic." However, one of the really cool things about Python is that it's fairly easy to write extensions. Using tools such as F2PY, you can simply take Fortran or C code and create a Python module.
Python extensions can be written to take advantage of multiple cores using a variety of programming models. A fairly straightforward programming model is OpenMP [15]. In this article, I illustrated two simple examples of creating Python modules with Fortran OpenMP code. With the help of F2PY, it is not difficult to take multicore code in another language and make it accessible from Python.
Infos
- GIL: https://wiki.python.org/moin/GlobalInterpreterLock
- Multiprocessing: https://pymotw.com/2/multiprocessing/
- Parallel Python: http://www.parallelpython.com/
- 0MQ: http://zeromq.org/bindings:python
- mpi4py: http://mpi4py.readthedocs.io/en/stable/index.html?highlight=bindings
- SWIG: http://www.swig.org/
- Pyfort: http://pyfortran.sourceforge.net/pyfort/pyfort_reference.htm
- F2PY Users Guide and Reference Manual: https://docs.scipy.org/doc/numpy-dev/f2py/
- Basic introductions online: http://www.aosc.umd.edu/~dkleist/docs/pythonTraining/Slides/F2Py_SSSO.pdf
- Florida State University Department of Scientific Computing site: https://people.sc.fsu.edu/~jburkardt/f_src/hello_openmp/hello_openmp.html
- GNU LGPL license: https://www.gnu.org/licenses/lgpl-3.0.txt
- Conda: https://conda.io/docs/index.html
- The quick and smart way: https://docs.scipy.org/doc/numpy-dev/f2py/getting-started.html#the-quick-and-smart-way
- More practical example: http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/91-pthreads/openMP-F90.html
- OpenMP: http://www.openmp.org/
Buy this article as PDF
(incl. VAT)