Is Hadoop the new HPC?

Where Worlds Collide

Programming

One of the big differences between Hadoop and HPC involves programming models. Most HPC applications are written in Fortran, C, or C++, with the aid of MPI libraries, as well as CUDA-based applications and those optimized for Intel Phi. The responsibility of the users is actually quite large.

Application authors must manage communications, I/O, synchronization, debugging, and possibly checkpointing/restart operations. These tasks can be difficult to get right and can take significant time to implement correctly and efficiently.

Hadoop, by offering the MapReduce paradigm, only requires the user to create a Map step and a Reduce step (and possibly some others, i.e., a combiner). These tasks are devoid of all the minutiae of HPC programming. Users only need concern themselves with these two tasks, which can be debugged and tested easily using small files on a single system.

Hadoop also presents a single-namespace parallel filesystem (HDFS) to the user. Hadoop was written in Java and has a low-level interface to write and run MapReduce applications, but it also supports an interface (Streams) that allows mappers and reducers to be written in any language. Above these language interfaces sit many high-level tools, such as Apache Pig, a scripting language for Hadoop MapReduce, and Apache Hive, a SQL-like interface to Hadoop MapReduce. Many users operate using these and other higher level tools and might never actually write mappers and reducers. This situation is analogous to application users in HPC that never write MPI code.

Parallel Computing Model

MapReduce can be classified as a single-instruction, multiple-data (SIMD) problem. Indeed, the map step is highly scalable because the same instructions are carried out over all data. Parallelism arises by breaking the data into independent parts with no forward or backward dependencies (side effects) within a Map step; that is, the Map step may not change any data (even its own). The reducer step is similar, in that it applies the same reduction process to a different set of data (the results of the Map step).

In general, the MapReduce model provides a functional, rather than procedural, programming model. Similar to a functional language, MapReduce cannot change the input data as part of the mapper or reducer process, which is usually a large file. Such restrictions can at first be seen as inefficient; however, the lack of side effects allows for easy scalability and redundancy.

An HPC cluster, on the other hand, can run SIMD and MIMD (multiple-instruction, multiple-data) jobs. The programmer determines how to execute the parallel algorithm. As noted above, this added flexibility comes with addition responsibilities. Users, however, are not restricted when creating their own MapReduce application within the framework of a typical HPC cluster.

Big Data Needs Big Solutions

Without a doubt, Hadoop is useful when analyzing very large data files. HPC has no shortage of "big data" files, and Hadoop has seen crossover into some technical computing areas: BioPig [3] extends Apache Pig with sequence analysis capability, and MR-MSPolygraph [4] is a MapReduce implementation of a hybrid spectral library/database search method for large-scale peptide identification. Results show that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours when processing tens of thousands of experimental spectra. Other applications include protein sequencing and linear algebra.

Provided your problem fits into the MapReduce framework, Hadoop is a powerful way to operate on staggeringly large data sets. Because both the Map and Reduce steps are user defined, highly complex operations can be encapsulated in these steps. Indeed, you encounter no hard requirements for a reducer step if all your work can be done in the map step.

The growth of Hadoop and the hardware on which it runs has been increasing. Certainly it can be seen as a subset of HPC, offering a single yet powerful algorithm that has been optimized for a large number of commodity servers, with some crossover even into technical computing that could see further growth. Many companies are finding Hadoop to be the new corporate HPC for big data.

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Is Hadoop the New HPC?

    Hadoop has been growing clusters in data centers at a rapid pace. Is Hadoop the new corporate HPC?

  • The New Hadoop

    Hadoop version 2 expands Hadoop beyond MapReduce and opens the door to MPI applications operating on large parallel data stores.

  • MapReduce and Hadoop

    Enterprises like Google and Facebook use the map–reduce approach to process petabyte-range volumes of data. For some analyses, it is an attractive alternative to SQL databases, and Apache Hadoop exists as an open source implementation.

  • Big data tools for midcaps and others
    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.
  • Hadoop for Small-to-Medium-Sized Businesses

    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.

comments powered by Disqus