Parallelizing and memorizing Python programs with Joblib
A Library for Many Jobs
In recent years, new programming concepts have enriched the computer world. Instead of increasing processor speed, the number of processors have grown in many data centers. Parallel processing supports the handling of large amounts of data but also often requires a delicate transition from traditional, sequential procedures to specially adapted methods. The Joblib Python library [1] saves much error-prone programming in typical procedures such as caching and parallelization.
A number of complex tasks come to mind when you think about parallel processing. The use of large data sets, wherein each entry is independent of the other, is an excellent choice for simultaneous processing by many CPUs (Figure 1). Tasks like this are referred to as "embarrassingly" parallel. Where exactly the term comes from is unclear, but it does suggest that converting an algorithm to a parallelized version should not take long.
Experienced developers know, however, that problems can occur in everyday programming practice in any new implementation and that you can quickly get bogged down in the implementation details. The Joblib module, an easy solution for embarrassingly parallel tasks, offers a Parallel
class, which requires an arbitrary function that takes exactly one argument.
Parallel Decoration
To help Parallel
cooperate with the function in question (I'll call it f(x)
), Joblib comes with a delayed()
method, which acts as the decorator. Listing 1 shows a simple example of an implementation of f(x)
that returns x
unchanged. The for
loop shown in Listing 1 iterates over a list l
and passes the individual values to f(x)
; each list item in l
results in a separate job.
Listing 1
Joblib: Embarrassingly Parallel
01 from joblib import Parallel, delayed 02 03 def f(x): 04 return x 05 06 l = range(5) 07 results = Parallel(n_jobs=-1)(delayed(f)(i) for i in l))
The most interesting part of the work is handled by the anonymous Parallel
object, which is generated on the fly. It distributes the calls to f(x)
across the computer's various CPUs or processor cores. The n_jobs
argument determines how many it uses. By default, this is set to 1, so that Parallel
only starts a subprocess. Setting it to -1
uses all the available cores, -2
leaves one core unused, -3
leaves two unused, and so on. Alternatively n_jobs
takes a positive integer as a counter that directly defines the number of processes to use.
The value of n_jobs
can also be more than the number of available physical cores; the Parallel
class simply starts the number of Python processes defined by n_jobs
, and the operating system lets them run side by side. Incidentally, this also means that the exchange of global variables between the individual jobs is impossible, because different operating system processes cannot directly communicate with one another. Parallel
bypasses this limitation by serializing and caching the necessary objects.
The optimum number of processes depends primarily on the type of tasks to be performed. If your bottleneck is reading and writing data to the local hard disk or across the network, rather than processor power, the number of processes can be higher. As a rule of thumb, you can go for the number of available processor cores times 1.5; however, if each process fully loads a CPU permanently, you will not want to exceed the number of available physical processors.
See How They Run
Additionally, the Parallel
class offers an optional verbose
argument with regular output of status messages that illustrate the overall progress. The messages show the number of processed and remaining jobs and, if possible, the estimated remaining and elapsed time.
The verbose
option is by default set to 0
; you can set it to an arbitrary positive number to increase the output frequency. Note that the higher the value of verbose
, the more intermediate steps Joblib outputs. Listing 2 shows typical output.
Listing 2
Parallel with Status Reports
Parallel(n_jobs=2, verbose=5)(delayed(f)(i) for i in l)) [Parallel(n_jobs=2)]: Done 1 out of 181 | elapsed: 0.0s remaining: 4.5s [Parallel(n_jobs=2)]: Done 198 out of 1000 | elapsed: 1.2s remaining: 4.8s [Parallel(n_jobs=2)]: Done 399 out of 1000 | elapsed: 2.3s remaining: 3.5s [Parallel(n_jobs=2)]: Done 600 out of 1000 | elapsed: 3.4s remaining: 2.3s [Parallel(n_jobs=2)]: Done 801 out of 1000 | elapsed: 4.5s remaining: 1.1s [Parallel(n_jobs=2)]: Done 1000 out of 1000 | elapsed: 5.5s finished
The exact number of interim reports varies. At the beginning of execution, it is often still unclear how many jobs are pending in total, so this number is only an approximation. If you set verbose
to a value above 10
, Parallel
outputs the current status after each iteration. Additionally, the argument offers the option of redirecting the output: If you set verbose
to a value of more than 50
, Parallel
writes status reports to standard output. If it is lower, Parallel
uses stderr
– that is, the error channel of the active shell.
A third, optional argument that Parallel
takes is pre_dispatch
, which defines how many of the jobs the class should queue up for immediate processing. By default, Parallel
directly loads all the list items into memory, and pre_dispatch
is set to 'all'
. However, if processing consumes a large amount of memory, a lower value provides an opportunity to save RAM. To do this, you can enter a positive integer.
Convenient Multiprocessing Module
With its Parallel
class, Joblib essentially provides a convenient interface for the Python multiprocessing
module. It supports the same functionality, but the combination of Parallel
and delayed()
reduces the implementation overhead of simple parallelization tasks to a one-liner. Additionally, status outputs and configuration options are available – each with an argument.
Buy this article as PDF
(incl. VAT)