Tool Your HPC Systems for Data Analytics
I was very hesitant to use the phrase ``Big Data'' in the title, because it's somewhat ill defined (plus some HPC people would have given me no end of grief), so I chose to use the more generic ``data analytics.'' I prefer this term because the basic definition refers to the process, not the size of the data or the ``three Vs'' [1] of Big Data: velocity, volume, and variety.
The definition I tend to favor is from TechTarget [2]: ``Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information.'' It doesn't mention the amount of data, although the implication is that there has to be enough to be meaningful. It doesn't say anything about the velocity, or variety, or volume in the definition. It simply communicates the high-level process.
Another way to think of data analytics is the combination of two concepts: ``data analysis'' and ``analytics.'' Data analysis [3] is very similar to data analytics, in that it is the process of massaging data with the goal of discovering useful information that can be used for suggesting conclusions and supporting decision making. Analytics [4], on the other hand, is the discovery and communication of meaningful patterns in data. Even though one could argue that analytics is really a subset of data analysis, I prefer to combine the two terms, so it gathers everything from collecting the data in raw form to examining the data with algorithms or mathematics (typically implying computations) to look for possible information. I'm sure some people will disagree with me, and that's perfectly fine. We're blind men trying to define something we can't easily see and isn't easy to define, even if you can see it. (Think of defining ``art,'' and you get the idea.)
Data analytics is the coming storm across the Oklahoma plains. You can see it miles away, and you had better get ready for it, or it will land on you with a pretty hard thump. The potential of data analytics (DA) has been fairly well documented. An easy example is PayPal, which uses DA and HPC for real-time fraud detection by adapting their algorithms all the time and throwing a great deal of computational horsepower into it. I don't want to dive into the mathematics, statistics, or machine learning of DA tools; instead, I want to take a different approach and discuss some aspects of data analytics that affect one of the audiences of this magazine -- HPC people.
Specifically, I want to discuss some of the characteristics or tendencies of DA applications, methods, and tools, since these workloads are finding their way into HPC systems. I know one director of a large HPC center who gets at least three or four requests a week from users who want to perform data analytics on the HPC systems. Digging a little deeper, the HPC staff finds that the users are mostly not ``classic'' HPC users, and they have their own tools and methods. Integrating their needs into existing HPC systems has proven to be more difficult than they thought. As a result, it might be a good idea to present some of the characteristics or tendencies of these tools and users so you can be prepared when the users start knocking on your door and sending you email. By the way, these tools might be running on your systems already and you don't even know it.
Workload Characteristics
Before jumping in with both feet and listing all of the things that are needed in DA workloads, I think it's far better first to describe or characterize the bulk of DA workloads, which might reveal some indicators for what is needed. With these characteristics, I'm aiming for the ``center of mass.'' I'm sure many people can come up with counterexamples, as can I, but I'm trying to develop some generalizations that can be used as a starting point.
In the subsequent sections, I'll describe some major workload characteristics, and I'll begin with the languages used in data analytics.
New Languages
The classic languages of HPC, Fortran and C/C++, are used somewhat in data analytics, but a whole host of new languages and tools are used as well. A wide variety of languages show up, particularly because Big Data is so hyped, which means everyone is trying to jump in with their particular favorite. However, a few have risen to the top:
- R [5]
- Python [6]
- Julia (up and coming) [7]
- Java [8]
- Matlab [9] and Matlab-compatible tools (Octave [10], Scilab [11], etc.)
Java is the lingua franca of MapReduce [12] and Hadoop [13]. Many toolkits for data analytics are written in Java, with the ability to be interfaced into other languages.
Because data analytics is, for the most part, about statistical methods, R, the language of statistics, is very popular. If you know a little Python or some Perl or some Matlab, then learning R is not too difficult. It has a tremendous number of built-in functions, primarily for statistics, and great graphics for making charts. Several of its libraries also are appropriate for data analytics (Table 1).
Table 1: R Libraries and Helpers | ||
---|---|---|
Software | Description | Source |
Analytics libraries | ||
R/parallel | Add-on package extends R by adding parallel computing capabilities | http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557021/ |
Rmpi | Wrapper to MPI | [http://www.stats.uwo.ca/faculty/yu/Rmpi/] |
HPC tools | R with BLAS, LAPACK, and MPI in Linux | [http://lostingeospace.blogspot.com/2012/06/r-and-hpc-blas-mpi-in-linux-environment.html] |
RHadoop [14] | R packages to manage and analyze data with Hadoop | [https://github.com/RevolutionAnalytics/RHadoop/wiki] |
Database tools [15] | ||
RSQLite [16] | R driver for SQLite | [http://cran.r-project.org/web/packages/RSQLite/index.html] |
rhbase [17] | Connectivity to HBASE | [https://github.com/RevolutionAnalytics/rhbase] |
graph | Package to handle graph data structures | [http://www.bioconductor.org/packages/devel/bioc/html/graph.html] |
neuralnet [18] | Training neural networks | [http://cran.r-project.org/web/packages/neuralnet/] |
Python is becoming one of the most popular programming languages. The language is well suited for numerical analysis and general programming. Although it comes with a great deal of capability, lots of add-ons extend Python in the DA kingdom (Table 2).
Table 2: Python Add-Ons | ||
---|---|---|
Software | Description | Source |
Pandas | Data analysis library for data analytics | [http://pandas.pydata.org] |
scikit-learn | Machine learning tools | [http://scikit-learn.org/stable/] |
SciPy | Open source software for mathematics, science, and engineering | [http://www.scipy.org] |
NumPy | A library for array objects including tools for integrating C/C++ and Fortran code, linear algebra computations, Fourier transforms, and random number capabilities | [http://www.numpy.org] |
matplotlib | Plotting library | [http://matplotlib.org] |
Database tools | ||
sqlite3 [19] | SQLite database interface | [https://docs.python.org/2/library/sqlite3.html] |
PostgreSQL [20] | Drivers for PostgreSQL | [https://wiki.postgresql.org/wiki/Python] |
MySQL-Python [21] | MySQL interface | [http://mysql-python.sourceforge.net] |
HappyBase | Library to interact with Apache HBase | [http://happybase.readthedocs.org/en/latest/] |
NoSQL | List of NoSQL packages | [http://nosql-database.org] |
PyBrain | Modular machine learning library | [http://pybrain.org] |
ffnet | Feed-forward neural network | [http://ffnet.sourceforge.net] |
Disco | Framework for distributed computing based on the MapReduce paradigm | [http://discoproject.org] |
Hadoopy [22] | Wrapper for Hadoop using Cython | [http://www.hadoopy.com/en/latest/] |
Graph libraries | ||
NetworkX | Package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks | [http://networkx.github.io] |
igraph | Network analysis | [http://igraph.org] |
python-graph | Library for working with graphs | [https://code.google.com/p/python-graph/] |
pydot | Interface to Graphviz's Dot language [23] | [https://code.google.com/p/pydot/] |
graph-tool | Manipulation and statistical analysis of graphs (networks) | [http://graph-tool.skewed.de] |
Julia is an up-and-coming language for HPC, but it is also drawing in DA researchers and practitioners. Julia is still a very young language; nonetheless, it has some very useful packages for data analytics (Table 3).
Table 3:Julia Packages | ||
---|---|---|
Software | Description | Source |
MLBase.jl | Functions to support the development of machine learning algorithms | [https://github.com/JuliaStats/MLBase.jl] |
StatsBase.jl | Basic statistics | [https://github.com/JuliaStats/StatsBase.jl] |
Distributions.jl | Probability distributions and associated functions | [https://github.com/JuliaStats/Distributions.jl] |
Optim.jl | Optimization functions | [https://github.com/JuliaOpt/Optim.jl] |
DataFrames.jl | Library for working with tabular data | [https://github.com/JuliaStats/DataFrames.jl] |
Gadfly.jl | Crafty statistical graphics | [https://github.com/dcjones/Gadfly.jl] |
PyPlot.jl | Interface to matplotlib [24] | [https://github.com/stevengj/PyPlot.jl] |
Matlab is a popular language in science and engineering, so it's natural for people to use it for data analytics. In general, a great deal of code is available for Matlab and Matlab-like applications (e.g., Octave and Scilab). Matlab and similar tools have data and matrix manipulation tools already built in, as well as graphics tools for plotting the results. You can write code in the respective languages of the different tools to create new functions and capabilities. The languages are reasonably close to each other, making portability easier than you might think, with the exception of graphical interfaces. Table 4 lists a few Matlab toolboxes from MathWorks and an open source toolbox for running parallel Matlab jobs. Octave and Scilab have similar functionality, but it might be spread across multiple toolboxes or come with the tool itself.
Table 4:Matlab Toolboxes | ||
---|---|---|
Software | Description | Source |
Statistics | Analyze and model data using statistics and machine learning | [http://www.mathworks.com/products/statistics/?s_cid=sol_des_sub2_relprod3_statistics_toolbox] |
Data Acquisition | Connect to data acquisition cards, devices, and modules | [http://www.mathworks.com/products/daq/] |
Image Processing | Image processing, analysis, and algorithm development | [http://www.mathworks.com/products/image/] |
Econometrics | Model and analyze financial and economic systems using statistical methods | [http://www.mathworks.com/products/econometrics/?s_cid=HP_FP_ML_EconometricsToolbox] |
System Identification | Linear and nonlinear dynamic system models from measured input-output data | [http://www.mathworks.com/products/sysid/] |
Database | Exchange data with relational databases | [http://www.mathworks.com/products/database/] |
Clustering and Data Analysis | Clustering and data analysis algorithms | [http://www.mathworks.com/matlabcentral/fileexchange/7473-clustering-and-data-analysis-toolbox] |
pMatlab [25] | Parallel Matlab toolbox | [http://www.ll.mit.edu/mission/cybersec/softwaretools/pmatlab/pmatlab.html] |
These are just a few of the links to DA libraries, modules, add-ons, toolboxes, or what have you for languages that are increasingly popular in the DA world.