« Previous 1 2 3 Next »
Parallelizing and memorizing Python programs with Joblib
A Library for Many Jobs
In Memory
The previous example made use of a small and practically useless f(x)
function. Some functions, however, regardless of whether they are parallelized, perform very time- and resource-consuming computations. If the input values are unknown before starting the program, a function like this might process the same arguments multiple times; this overhead could be unnecessary.
For more complicated functions, therefore, it makes sense to save the results (memorization). If the function is invoked with the same argument, it just grabs the existing result, rather than figuring things out again. Here, too programmers can look to Joblib for help – in the form of the Memory
class this time.
Joblib provides a cache()
method that serves as the decorator for arbitrary functions with one or more functional arguments. The results of the decorated function are then saved on disk by the memory object. The next time the function is called, it checks to see whether the same argument or the same arguments have been processed and, if so, returns the result directly (Figure 2). Listing 3 shows an implementation, again with a primitive sample function f(x)
.
Listing 3
Store Function Results
01 from joblib import Memory 02 03 memory = Memory(cachedir='/tmp/example/') 04 05 @memory.cache 06 def f(x): 07 return x
The computed results are stored on disk in the JOBLIB
directory below the directory defined by the cachedir
parameter. Here again, each memorized function has its own subdirectory, which, among other things, contains the original Python source code of the function in the func_code.py
file.
Memory for Names
A separate subdirectory also exists for each different argument – or, depending on the function, each different combination of several arguments. It is named for a hash value of the arguments passed in and contains two files: input_args.json
and output.pkl
. The first file shows the input arguments in the human-readable JSON format, and the second is the corresponding result in the binary pickle format used by Python to serialize and store objects.
This structure makes accessing the cached results of the memory function pleasantly transparent. The Python pickle
module, for example, parses the results in the Python interpreter:
import pickle result = pickle.load(open("output.pkl"))
Memory
does not clean up on its own at end of the program, which means the stored results are still available the next time the program starts. However, this also means you need to clean up the disk space yourself, if necessary, which is done either by calling the clear()
method of the Memory
object or simply by deleting the folder.
Additionally, you should note that Memory
bases its reading of the stored results exclusively on the function name. If you change the implementation, Memory
might erroneously return the results previously generated by the old version of the function the next time you start it. Furthermore, Memory
does not work with lambda
functions – that is, nameless functions that are defined directly in the call.
As a general rule, use of the Memory
class is recommended for functions whose results are so large they stress your computer's RAM. If a frequently called function only generates small results, however, creating a dictionary-based in-memory cache makes more sense. A sample implementation is shown on the ActiveState website [2].
Fast and Frugal
On request, the Memory
class uses a procedure that saves a lot of time for large stored objects: memory mapping. The core idea behind this approach is to write a file as a bit-for-bit copy of an object from memory to the hard disk. When the software opens the object again, it copies the relevant part of the file into contiguous memory so that the objects it contains are directly available. This approach keeps the system from having to allocate memory, which can involve many system calls.
Joblib uses the memory mapping method provided by the NumPy [3] Python module. The constructor in the Memory
class uses the optional mmap_mode
parameter to accept the same arguments as the numpy.memmap
class: r+
, r
, w+
, and c
.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.