Easy Peasy Parallel: Julia Distributed Arrays

In this final article on the Julia Language, we look at a few other useful features, including distributed arrays and possible GPU programming.

In parts 1 and 2 of this series, I introduced the Julia language and highlighted the parallel computing aspects. In this final, but by no means complete, article, some other aspects of Julia are introduced. In case you missed the first two installments, Julia is, among other things, designed for high-performance computing, open source, fast, parallel, scalable, easy to learn, and extensible. It fills a void in HPC that allows users to “tinker” with hardware and software.

To assist your exploration of Julia, extremely good documentation is available online. If you want to build Julia, it is best to pull the latest release down from the web. Currently, Julia builds support:

  • GNU/Linux: x86/64 (64-bit); x86 (32-bit).
  • Darwin/OS X: x86/64 (64-bit); x86 (32-bit).
  • FreeBSD: x86/64 (64-bit); x86 (32-bit).

The examples used in this column were built and run on a Limulus personal cluster running Scientific Linux 6.2 on an Intel i5-2400S with 4GB of memory. If don't want to bother building Julia, but would like to play with it over the web, check out the online Julia version (do not enter Your Name or Your Session, just click the Try Julia Now bar.) The previous column had instructions on how to build Julia locally.

Why Glue is Important

Before we return to some of the HPC or parallel features of Julia, a few utility features are worth noting. These features are popular in other interpreted languages and allow integration with existing programs and libraries. The key advantage to this approach is to “not reinvent the wheel and leverage what has been done.” Consider the following two examples:

julia> run(`echo hello`)
hello

julia> run(`date`)
Tue Jul 31 15:13:25 EDT 2012

Julia treats shell commands and external programs differently from most languages. Instead of “shelling out” and immediately running the command, back-ticks create a command object to represent the command. You can use this object to connect the command to others via pipes, run it, and read or write to it. Additionally, when the command is run, Julia does not capture its output unless you specify otherwise. Instead, the output of the command goes to stdout. Interestingly, the command is never run with a shell. Instead, Julia parses the command syntax directly, interpolating variables and delimiting on words as the shell would. The task runs it as an immediate child process (using fork and exec calls). Check Running External Programs in the documentation for more detailed information.

As a simple example, if you wanted to capture stdout and use it in your Julia program, you could simply do the following:

julia> d=readall(`date`)
"Tue Jul 31 15:51:58 EDT 2012\n"

julia> print(d)
Tue Jul 31 15:51:58 EDT 2012

In addition to efficient shell integration, Julia also has a very nice C and Fortran interface. Typically, every new language has some form of interface to C that allows the program to “call out” to C libraries. Fortran “call-outs” are not so common, however, but Julia supports these as well. Traditionally, these call-out hooks can sometimes be quite cumbersome. Julia’s method is refreshingly simple. First, using dlopen, a shared library can be opened (and an opaque handle is also returned). For example, you load libc as follows:

julia> libc = dlopen("/lib64/libc.so.6")Ptr{Void} @0x0000000003416b70

Next, the dlsym command can be used to look up a symbol from a shared library handle and return a pointer to the function:

julia> libc_clock = dlsym(libc, :clock)
Ptr{Void} @0x0000003ff689aab0

Finally, the ccall function can be used to call a library command.

julia> t = ccall(dlsym(libc, :clock), Int32, ())

julia> print(t)
2310000
julia>

This interface makes the re-use of existing code extremely easy, provided it was written in C or Fortran. C++ support is limited. Also, when calling a Fortran function, all inputs must be passed by reference. Consult the Calling C and Fortran Code documentation for more information about the the ccall example above.

Back Into The Deep End

In a previous article, I mentioned some of the ways Julia uses multiple cores and distributed resources, as well as how to submit jobs to the Sun Grid Engine from within Julia. Another helpful feature is distributed arrays.

Large computations are often organized around large arrays of data. In these cases, a particularly natural way to obtain parallelism is to distribute arrays among several processors. This combines the memory resources of multiple machines, allowing the creation and use of arrays that would be too large to fit on one machine. Each processor operates on the part of the array it owns, providing a ready answer to the question of how a program should be divided among machines.

A distributed array is logically a single array, but pieces of it are stored on different processors. This means whole-array operations such as matrix multiply, scalar-array multiplication, and so on use the same syntax as with local arrays, and the parallelism is invisible. In some cases, it is possible to obtain useful parallelism just by changing a local array to a distributed array.

Julia distributed arrays are implemented by the DArray type. A DArray has an element type and dimensions just like a Julia array, but it also needs an additional property: the dimension along which data is distributed. Data can be distributed among processors many ways, but at this time Julia keeps things simple and only allows distributing along a single dimension. For example, if a two-dimensional DArray is distributed in dimension 1, it means each processor holds a certain range of rows. If it is distributed in dimension 2, each processor holds a certain range of columns.

The following example should help illustrate the use of distributed arrays. First, I start Julia with four processors (cores) and create an 8x8 array with random data.

$ julia -q -p 4

julia> a = rand(8,8);

julia> print(a);
8x8 Float64 Array:
 0.763627    0.419707   0.166449  0.183866   0.716457  0.426178  0.383575  0.954777
 0.578849    0.611247   0.299826  0.280822   0.61021   0.621362  0.120129  0.504368
 0.5186      0.0922815  0.662646  0.325087   0.231755  0.980836  0.652299  0.354965
 0.00255302  0.0975773  0.205617  0.0159737  0.700618  0.931636  0.933376  0.927181
 0.590559    0.968848   0.337702  0.479313   0.395335  0.537262  0.979579  0.850812
 0.257384    0.91292    0.446213  0.540407   0.489686  0.541117  0.731585  0.373413
 0.885526    0.277182   0.897286  0.0168283  0.913601  0.928691  0.50249   0.587306
 0.558348    0.674816   0.446026  0.63746    0.502827  0.223377  0.651495  0.0971838

Next, I’ll convert this array to a distributed array and distribute it over the columns with the distribute(array,dist-dimm) function. The localize(array) function returns the local portion (8 rows by 2 columns) of my now-distributed array.

julia> j=distribute(a,2);

julia> print(localize(j));
8x2 Float64 Array:
 0.763627    0.419707
 0.578849    0.611247
 0.5186      0.0922815
 0.00255302  0.0975773
 0.590559    0.968848
 0.257384    0.91292
 0.885526    0.277182
 0.558348    0.674816

Likewise, I can distribute the array over 2 rows and 8 columns:

julia> j=distribute(a,1);

julia> print(localize(j));
2x8 Float64 Array:
 0.763627  0.419707  0.166449  0.183866  0.716457  0.426178  0.383575  0.954777
 0.578849  0.611247  0.299826  0.280822  0.61021   0.621362  0.120129  0.504368

To get the ID of the processor holding index i in the distributed dimension, I can use the owner(array,index) function. For instance, in the above distributed array, j, which is distributed by rows (two rows per processor), I can find out what processor holds which row of array j (recall I started Julia with four cores):

julia> print(owner(j,1));
1
julia> print(owner(j,2));
1
julia> print(owner(j,5));
3
julia> print(owner(j,7));
4

Similarly, a processor can ask: “What is the range of my indexes?”

julia> print(myindexes(j));
(1:2,1:8)

Because the array is a global object, any processor can see and change the array, although it might not be efficient to act on a distributed array from a processor that does not “own” the data. For instance, elements of the distributed array, j (created above) can be changed (not just the local references; in this case, rows 1-2 and columns 1-8):

julia>  j[2,2]=0;

julia>  j[8,8]=0;

julia> print(j);
8x8 Float64 DArray:
 0.763627    0.419707   0.166449  0.183866   0.716457  0.426178  0.383575  0.954777
 0.578849    0.0        0.299826  0.280822   0.61021   0.621362  0.120129  0.504368
 0.5186      0.0922815  0.662646  0.325087   0.231755  0.980836  0.652299  0.354965
 0.00255302  0.0975773  0.205617  0.0159737  0.700618  0.931636  0.933376  0.927181
 0.590559    0.968848   0.337702  0.479313   0.395335  0.537262  0.979579  0.850812
 0.257384    0.91292    0.446213  0.540407   0.489686  0.541117  0.731585  0.373413
 0.885526    0.277182   0.897286  0.0168283  0.913601  0.928691  0.50249   0.587306
 0.558348    0.674816   0.446026  0.63746    0.502827  0.223377  0.651495  0.0

Keep in mind that this can invalidate data and create race conditions if not used carefully (i.e., only use local indexes in parallel portions of your program). Finally, a distributed array can be converted to a local array as follows:

n=convert(Float64,j);

Global arrays can be used by spawning jobs on other processors (as demonstrated in the last article). In particular, with the use of @spawnat and owner(), work can be directed to the processor that holds a particular slice of an array. Also, distributed arrays can be declared directly without the need to convert them from basic arrays. You can find more details in the parallel computing section of the documentation; however, the simplicity of distributed arrays is obvious from the above examples.

And There Is More!

Recently, NVidia announced that it is contributing its CUDA compiler to the LLVM compiler as open source. Basically, this means languages like Julia that use the LLVM compiler can take advantage of GPU acceleration using the CUDA model. Not long after this announcement, a thread appeared on the Julia Developers List discussing the possibility that, in the not too distant future, users will be able to program NVidia GPUs from within Julia. Normally such a desirable feature would take quite a bit of time to implement, but the inclusion of CUDA in LLVM should make it much easier and quicker to implement.

Even as a young language Julia is making a big impression on HPC users. Its programming simplicity coupled with performance and easy integration has long been missing in the HPC tool box. Having the ability to tinker with multiple cores, cluster nodes, and possibly NVidia GPUs all in one easy-to-use language is a real breakthrough in HPC. Julia promises to open up whole new vistas of HPC software applications and development. Finally!