Is Hadoop the new HPC?

Where Worlds Collide

Apache Hadoop [1] has been generating a lot of headlines lately. For those who are not aware, Hadoop is an open source project that provides a distributed filesystem and MapReduce framework for massive amounts of data. The primary hardware used for Hadoop comprises clusters of commodity servers. File sizes can easily be in the petabyte range and use hundreds or thousands of compute servers.

Hadoop also has many components that live on top of the core Hadoop filesystem (HDFS) and MapReduce mechanism. Interestingly, high-performance computing (HPC) and Hadoop clusters share some features, but how much crossover you will see between the two disciplines depends on the application. Hadoop's strengths lie in the sheer size of data it can process and its high redundancy and toleration of node failures without halting user jobs.

Many organizations use Hadoop on a daily basis, including Yahoo!, Facebook, American Airlines, eBay, and others. Hadoop is designed to allow users to manipulate large unstructured or unrelated data sets. It is not intended to be a replacement for a relational database management system (RDMS). For example, Hadoop can be used to scan weblogs, online transaction data, or web content, all of which are growing each year.

MapReduce

To many HPC users, MapReduce is a methodology used by Google to process large amounts of web data. Indeed, the now famous Google MapReduce paper [2] was the inspiration for Hadoop.

The MapReduce idea is quite simple and, when used in parallel, can provide extremely powerful search and compute capabilities. Two major steps constitute the MapReduce process. If you have not figured it out, they are the "Map" step followed by a "Reduce" step. Some people are surprised to learn that mapping is done all

...

Use Express-Checkout link below to read the full article (PDF).