Is Hadoop the New HPC?
Apache Hadoop has been generating a lot of headlines lately. For those who are not aware, Hadoop is an open source project that provides a distributed filesystem and MapReduce framework for massive amounts of data. The primary hardware used for Hadoop is clusters of commodity servers. File sizes can easily be in the petabyte range and use hundreds or thousands of compute servers.
Hadoop also has many components that live on top of the core Hadoop filesystem (HDFS) and MapReduce mechanism. Interestingly, HPC and Hadoop clusters share some features, but how much crossover you will see between the two disciplines depends on the application. Hadoop strengths lie in the sheer size of data it can process and its high redundancy and toleration of node failures without halting user jobs.
Who Uses Hadoop
Many organizations use Hadoop on a daily basis, including Yahoo, Facebook, American Airlines, eBay, and others. Hadoop is designed to allow users to manipulate large unstructured or unrelated data sets. It is not intended to be a replacement for a RDMS. For example, Hadoop can be used to scan weblogs, on-line transaction data, or web content, all of which are growing each year.
Map Reduce
To many HPC users, MapReduce is a methodology used by Google to process large amounts of web data. Indeed, the now famous Google MapReduce paper was the inspiration for Hadoop.
The MapReduce idea is quite simple and, when used in parallel, can provide extremely powerful search and compute capabilities. Two major steps constitute the MapReduce process. If you have not figured it out, they are the “Map” step followed by a “Reduce” step. Some are surprised to learn that mapping is done all the time in the *nix world. For instance consider:
grep "the" file.txt
In this simple example, I am “mapping” all the occurrences, by line, of the word “the” in a text file. Although the task seems somewhat trivial, suppose the file was 1TB. How could I speed up the mapping step. The answer is also simple: Break the file into chunks and put a different chunk on a separate computer. The results can be combined when the job is finished because the map step has no dependencies. The popular mpiBLAST tool takes the same approach by breaking the human genome file into chunks and performing “BLAST” mapping on separate cluster nodes.
Suppose you want to calculate the total number of lines containing ‘the’. The simple answer is to pipe the results into wc (word count):
grep "the" file.txt | wc -l
You have just introduced a “Reduce” step. For the large-file parallel mode, each computer would perform the above step (grep and wc) and send the count to the master node. That, in a nutshell, is how MapReduce works – with, of course, a few more details, like key-value pairs and “the shuffle” – but for the purposes of this discussion, MapReduce can be that simple.
With Hadoop, large files are placed in HDFS, which automatically breaks the file into chunks and spreads them across the cluster (usually in a redundant fashion). In this way, parallelizing the Map process is trivial; all that needs to happen is to place a separate Map process on each node with the file chunk. The results are then sent to Reduce processes, which also run on the cluster. As you can imagine, large files produce large amounts of intermediate data: thus, multiple reducers help keep things moving. Several aspects to the MapReduce process worth noting:
- MapReduce can be transparently scalable. The user does not need to manage data placement or the number of nodes used for their job. The underlying hardware has no dependencies.
- Data flow is highly defined and in one direction from the Map to the Reduce, with no communication between independent mapper or reducer processes.
- Because processing is independent, failover is trivial. A failed process can be restarted, provided the underlying filesystem is redundant like HDFS.
- MapReduce, while powerful, does not fit all problem types.
To understand the difference between Hadoop and a typical HPC cluster, I’ll compare several aspects of both systems.
Hardware
Many modern HPC clusters and Hadoop clusters use commodity hardware, comprising primarily x86-based servers. Hadoop clusters usually include a large amount of local disk space (used for HDFS nodes), whereas many HPC clusters rely on NFS or a parallel filesystem for cluster-wide storage. HPC uses diskless and diskful nodes, but in terms of data storage, a separate group of hardware is often used for global file storage. HDFS daemons run on all nodes and store data chunks locally. It does not support the POSIX standard. Hadoop is designed to move the computation to the data; thus, HDFS needs to be distributed throughout the cluster.
In terms of networking, Hadoop clusters almost exclusively use gigabit Ethernet (GigE). As the price continues to fall, newer systems are starting to adopt 10GigE. Although, there are many GigE and 10GigE HPC clusters, InfiniBand is often the preferred network.
Many new HPC clusters are using some form of acceleration hardware on the nodes. These additions are primarily from NVidia (Kepler) and Intel (Phi). They require additional programming (in some cases) and can provide substantial speed-up for certain applications.
Resource Scheduling
One of the biggest differences between Hadoop and HPC systems is resource management. HPC requires fine-grained control of what resources (cores, accelerators, memory, time, etc.) are given to users. These resources are scheduled with tools like Grid Engine, Moab, LoadLeveler, and so on. Hadoop has an integrated scheduler that consists of a master Job Tracker, which communicates with Task Trackers on the nodes. All MapReduce work is supervised by the Job Tracker. No other job types are supported in Hadoop (Version 1).
One interesting difference between an HPC resource scheduler and the Hadoop Task Tracker is fault tolerance. HPC schedulers can detect down nodes and reschedule jobs (as an option), but if the job has not been checkpointing, it must start from the beginning. Hadoop, because of the nature of the MapReduce algorithm, can manage failure through the Job Tracker. Because the Task Tracker is aware of job placement and data location, a failed node (or even a rack of nodes) can be managed at run time. Thus, when an HDFS node fails, the Job Tracker can reassign a task to a node where a redundant copy of the data exists. Similarly, if a Map or Reduce process fails, the job can be restarted on a new node.
The next-generation scheduler for Hadoop is called YARN (Yet Another Resource Negotiator) and offers better scalability and more fine-grained control over job scheduling. Users can request “containers” for MapReduce and other jobs (possibly MPI), which are managed by individual per-job Application Masters. With YARN, the Hadoop scheduler starts to look like other resource managers; however, it will be backward compatible with many higher level Hadoop tools.
Programming
One of the big differences between Hadoop and HPC is programming models. Most HPC applications are written in Fortran, C, or C++, with the aid of MPI libraries, as well as CUDA-based applications and those optimized for Intel Phi. The responsibility of the users is actually quite large. Application authors must manage communications, synchronization, I/O, debugging, and possibly checkpointing/restart operations. These tasks often are not easy to get right and can take significant time to implement correctly and efficiently.
Hadoop, by offering the MapReduce paradigm, only requires that the user create a Map step and Reduce step (and possibly some others, i.e., a combiner). These tasks are devoid of all the minutia of HPC programming. Users only need concern themselves with these two tasks, which can be debugged and tested easily using small files on a single system. Hadoop also presents a single-namespace parallel file system (HDFS) to the user. Hadoop was written in Java and has a low-level interface to write and run MapReduce applications, but it also supports an interface (Streams) that allows mappers and reducers to be written in any language. Above these language interfaces sit many high-level tools, such as Apache Pig, a scripting language for Hadoop MapReduce, and Apache Hive, a SQL-like interface to Hadoop MapReduce. Many users operate using these and other higher level tools and might never actually write mappers and reducers. This situation is analogous to application users in HPC that never write MPI code.
Parallel Computing Model
MapReduce can be classified as a SIMD (single-instruction, multiple-data) problem. Indeed, the map step is highly scalable because the same instructions are carried out over all data. Parallelism arises by breaking the data into independent parts with no forward or backward dependencies (side effects) within a Map step; that is, the Map step may not change any data (even its own). The reducer step is similar, in that it applies the same reduction process to a different set of data (the results of the Map step).
In general, the MapReduce model provides a functional, rather than procedural, programing model. Similar to a functional language, MapReduce cannot change the input data as part of the mapper or reducer process, which is usually a large file. Such restrictions can at first be seen as inefficient; however, the lack of side effects allows for easy scalability and redundancy.
An HPC cluster, on the other hand, can run SIMD and MIMD (multiple-instruction, multiple-data) jobs. The programmer determines how to execute the parallel algorithm. As noted above, this added flexibility comes with addition responsibilities. Users, however, are not restricted when creating their own MapReduce application within the framework of a typical HPC cluster.
Big Data Needs Big Solutions
Without a doubt, Hadoop is useful when analyzing very large data files. HPC has no shortage of “big data” files, and Hadoop has seen crossover into some technical computing areas: BioPig extends Apache Pig with sequence analysis capability, and MR-MSPolygraph is a MapReduce implementation of a hybrid spectral library–database search method for large-scale peptide identification. In the case of MR-MSPolygraph, results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours when processing tens of thousands of experimental spectra. Other applications include Protein sequencing and linear algebra.
Provided your problem fits into the MapReduce framework, Hadoop is a powerful way to operate on staggeringly large data sets. Because both the Map and Reduce steps are user defined, highly complex operations can be encapsulated in these steps. Indeed, you encounter no hard requirements for a reducer step if all your work can be done in the Map step.
The growth of Hadoop and the hardware on which it runs has been increasing. Certainly it can be seen as a subset of HPC, offering a single yet powerful algorithm that has been optimized for a large number of commodity servers, with some crossover even into technical computing that could see further growth as things like YARN begin to give existing Hadoop clusters more HPC capabilities. Many companies are finding Hadoop to be the new Corporate HPC for big data.