« Previous 1 2
Is Hadoop the new HPC?
Where Worlds Collide
Programming
One of the big differences between Hadoop and HPC involves programming models. Most HPC applications are written in Fortran, C, or C++, with the aid of MPI libraries, as well as CUDA-based applications and those optimized for Intel Phi. The responsibility of the users is actually quite large.
Application authors must manage communications, I/O, synchronization, debugging, and possibly checkpointing/restart operations. These tasks can be difficult to get right and can take significant time to implement correctly and efficiently.
Hadoop, by offering the MapReduce paradigm, only requires the user to create a Map step and a Reduce step (and possibly some others, i.e., a combiner). These tasks are devoid of all the minutiae of HPC programming. Users only need concern themselves with these two tasks, which can be debugged and tested easily using small files on a single system.
Hadoop also presents a single-namespace parallel filesystem (HDFS) to the user. Hadoop was written in Java and has a low-level interface to write and run MapReduce applications, but it also supports an interface (Streams) that allows mappers and reducers to be written in any language. Above these language interfaces sit many high-level tools, such as Apache Pig, a scripting language for Hadoop MapReduce, and Apache Hive, a SQL-like interface to Hadoop MapReduce. Many users operate using these and other higher level tools and might never actually write mappers and reducers. This situation is analogous to application users in HPC that never write MPI code.
Parallel Computing Model
MapReduce can be classified as a single-instruction, multiple-data (SIMD) problem. Indeed, the map step is highly scalable because the same instructions are carried out over all data. Parallelism arises by breaking the data into independent parts with no forward or backward dependencies (side effects) within a Map step; that is, the Map step may not change any data (even its own). The reducer step is similar, in that it applies the same reduction process to a different set of data (the results of the Map step).
In general, the MapReduce model provides a functional, rather than procedural, programming model. Similar to a functional language, MapReduce cannot change the input data as part of the mapper or reducer process, which is usually a large file. Such restrictions can at first be seen as inefficient; however, the lack of side effects allows for easy scalability and redundancy.
An HPC cluster, on the other hand, can run SIMD and MIMD (multiple-instruction, multiple-data) jobs. The programmer determines how to execute the parallel algorithm. As noted above, this added flexibility comes with addition responsibilities. Users, however, are not restricted when creating their own MapReduce application within the framework of a typical HPC cluster.
Big Data Needs Big Solutions
Without a doubt, Hadoop is useful when analyzing very large data files. HPC has no shortage of "big data" files, and Hadoop has seen crossover into some technical computing areas: BioPig [3] extends Apache Pig with sequence analysis capability, and MR-MSPolygraph [4] is a MapReduce implementation of a hybrid spectral library/database search method for large-scale peptide identification. Results show that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours when processing tens of thousands of experimental spectra. Other applications include protein sequencing and linear algebra.
Provided your problem fits into the MapReduce framework, Hadoop is a powerful way to operate on staggeringly large data sets. Because both the Map and Reduce steps are user defined, highly complex operations can be encapsulated in these steps. Indeed, you encounter no hard requirements for a reducer step if all your work can be done in the map step.
The growth of Hadoop and the hardware on which it runs has been increasing. Certainly it can be seen as a subset of HPC, offering a single yet powerful algorithm that has been optimized for a large number of commodity servers, with some crossover even into technical computing that could see further growth. Many companies are finding Hadoop to be the new corporate HPC for big data.
Infos
- Apache Hadoop: http://hadoop.apache.org/
- MapReduce: Simplified Data Processing on Large Clusters : http://research.google.com/archive/mapreduce.html
- BioPig: http://www.osti.gov/bridge/product.biblio.jsp?osti_id=1050659
- MR-MSPolygraph: http://compbio.eecs.wsu.edu/MR-MSPolygraph/
« Previous 1 2
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.