Is Hadoop the new HPC?
Where Worlds Collide
Apache Hadoop [1] has been generating a lot of headlines lately. For those who are not aware, Hadoop is an open source project that provides a distributed filesystem and MapReduce framework for massive amounts of data. The primary hardware used for Hadoop comprises clusters of commodity servers. File sizes can easily be in the petabyte range and use hundreds or thousands of compute servers.
Hadoop also has many components that live on top of the core Hadoop filesystem (HDFS) and MapReduce mechanism. Interestingly, high-performance computing (HPC) and Hadoop clusters share some features, but how much crossover you will see between the two disciplines depends on the application. Hadoop's strengths lie in the sheer size of data it can process and its high redundancy and toleration of node failures without halting user jobs.
Many organizations use Hadoop on a daily basis, including Yahoo!, Facebook, American Airlines, eBay, and others. Hadoop is designed to allow users to manipulate large unstructured or unrelated data sets. It is not intended to be a replacement for a relational database management system (RDMS). For example, Hadoop can be used to scan weblogs, online transaction data, or web content, all of which are growing each year.
MapReduce
To many HPC users, MapReduce is a methodology used by Google to process large amounts of web data. Indeed, the now famous Google MapReduce paper [2] was the inspiration for Hadoop.
The MapReduce idea is quite simple and, when used in parallel, can provide extremely powerful search and compute capabilities. Two major steps constitute the MapReduce process. If you have not figured it out, they are the "Map" step followed by a "Reduce" step. Some people are surprised to learn that mapping is done all the time in the *nix world. For example, consider:
grep "the" file.txt
In this simple example, I am "mapping" all the occurrences, by line, of the word "the" in a text file. Although the task seems somewhat trivial, suppose the file was 1TB. How could I speed up the mapping step? The answer is also simple: Break the file into chunks and put a different chunk on a separate computer. The results can be combined when the job is finished because the map step has no dependencies. The popular mpiBLAST tool takes the same approach by breaking the human genome file into chunks and performing "BLAST" mapping on separate cluster nodes.
Suppose you want to calculate the total number of lines containing "the"; a simple approach is to pipe the results into wc
(word count):
grep "the" file.txt | wc -l
You have just introduced a "Reduce" step. For the large-file parallel mode, each computer would perform this step (grep
and wc
) and send the count to the master node. That, in a nutshell, is how MapReduce works, with, of course, a few more details, like key-value pairs and "the shuffle." But, for the purposes of this discussion, MapReduce can be that simple.
With Hadoop, large files are placed in HDFS, which automatically breaks the file into chunks and spreads them across the cluster (usually in a redundant fashion). In this way, parallelizing the Map process is trivial; all that needs to happen is to place a separate Map process on each node with the file chunk. The results are then sent to Reduce processes, which also run on the cluster. As you can imagine, large files produce large amounts of intermediate data; thus, multiple reducers help keep things moving. Several aspects to the MapReduce process are worth noting:
- MapReduce can be transparently scalable. The user does not need to manage data placement or the number of nodes used for their job. The underlying hardware has no dependencies.
- Data flow is highly defined and in one direction from the Map to the Reduce, with no communication between independent mapper or reducer processes.
- Because processing is independent, failover is trivial. A failed process can be restarted, provided that the underlying filesystem is redundant – like HDFS.
MapReduce, while powerful, does not fit all problem types. To understand the difference between Hadoop and a typical HPC cluster, I'll compare several aspects of both systems.
Hardware
Many modern HPC clusters and Hadoop clusters use commodity hardware, comprising primarily x86-based servers. Hadoop clusters usually include a large amount of local disk space (used for HDFS nodes), whereas many HPC clusters rely on NFS or a parallel filesystem for cluster-wide storage.
HPC uses diskless and diskful nodes, but in terms of data storage, a separate group of hardware is often used for global file storage. HDFS daemons run on all nodes and store data chunks locally. It does not support the POSIX standard. Hadoop is designed to move the computation to the data; thus, HDFS must be distributed throughout the cluster.
In terms of networking, Hadoop clusters almost exclusively use gigabit Ethernet (GigE). As the price continues to fall, newer systems are starting to adopt 10GigE. Although, there are many GigE and 10GigE HPC clusters, InfiniBand is often the preferred network.
Many new HPC clusters are using some form of acceleration hardware on the nodes. These additions are primarily from NVidia (Kepler) and Intel (Phi). They require additional programming (in some cases) and can provide substantial speed increases for certain applications.
Resource Scheduling
One of the biggest differences between Hadoop and HPC systems is resource management. HPC requires fine-grained control of which resources (cores, accelerators, memory, time, etc.) are given to users. These resources are scheduled with tools like Grid Engine, Moab, LoadLeveler, and the like. Hadoop has an integrated scheduler consisting of a master Job Tracker, which communicates with Task Trackers on the nodes. All MapReduce work is supervised by the Job Tracker. No other job types are supported in Hadoop (Version 1).
One interesting difference between an HPC resource scheduler and the Hadoop Task Tracker is fault tolerance. HPC schedulers can detect down nodes and reschedule jobs (as an option), but if the job has not been checkpointing, it must start from the beginning. Hadoop, because of the nature of the MapReduce algorithm, can manage failure through the Job Tracker.
Because the Task Tracker is aware of job placement and data location, a failed node (or even a rack of nodes) can be managed at run time. Thus, when an HDFS node fails, the Job Tracker can reassign a task to a node where a redundant copy of the data exists. Similarly, if a Map or Reduce process fails, the job can be restarted on a new node.
The next-generation scheduler for Hadoop is called YARN (Yet Another Resource Negotiator) and offers better scalability and more fine-grained control over job scheduling. Users can request "containers" for MapReduce and other jobs (possibly MPI), which are managed by individual per-job Application Masters. With YARN, the Hadoop scheduler starts to look like other resource managers; however, it will be backward compatible with many higher level Hadoop tools.
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.