OpenStack Sahara brings Hadoop as a Service

Computing Machine

Hardware for Sahara

Anyone who wants to offer Hadoop as a Service, needs to use large CPUs, a generous helping of RAM, and, ideally, fast 10Gb network cards. However, this alone is still not enough; Hadoop is only really fast when it can use fast local storage.

As a reminder, the default configuration of OpenStack packs persistent VMs onto storage that is connected in the background via iSCSI. This might not be very elegant from a technical point of view, but beyond that, it is very slow. Alternatives offer high throughput in the form of Ceph. What most of the alternatives have in common is that they come with fairly high latency, because the packets always have to traverse the network.

Local storage helps. If the VM is running on the host and using the system's local storage, the detour via the network is eliminated. Until the most recent OpenStack release (Kilo), OpenStack was unable to map the connection between a VM and storage created with Cinder. Administrators could thus choose whether they wanted to run a VM on persistent network storage or locally on the individual hypervisors – but then not persistently.

In Kilo, the developers retrofitted a long-desired function from which Sahara will also benefit: It is now possible to specify that Cinder should create a volume on the host from which the virtual machine starts. The storage operator in Cinder might then have to take care of the topic of high availability itself, but that should be possible with a detour via DRBD9 [5], for example.

Conclusions

Cloud and Big Data are like chalk and cheese. Ultimately, it was precisely the large HPC setups that first sounded the triumph of cloud computing. For providers to make large amounts of resources available that enables customers to operate Hadoop dynamically and flexibly is certainly a very coherent approach. The cloud particularly offers the advantage that the customer can tap into a production environment immediately, instead of first having to deploy a hardware zoo in their racks.

Fortunately, Sahara developers have solved many problems from earlier times. That a VM can be started from the exact place where the volume provisioned by Cinder is also located makes Hadoop useful at the outset, because Hadoop only works well with fast storage.

However, one big drawback remains: Currently only a few providers operate publicly accessible OpenStack clouds, and those who do only support Hadoop in the rarest of cases and don't offer Sahara support. Except for a DIY cloud, you have virtually no option for using Hadoop's functionality in everyday life – a functionality that is actually very useful. This is a shame, because if a provider were to add Sahara to its portfolio, that provider could probably rely on a multitude of customers rushing to sign up.

The Author

Martin Gerhard Loschwitz works as a cloud architect at SysEleven. He works with OpenStack, distributed storage, and Puppet. He also maintains Pacemaker for Debian in his spare time.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • The new OpenStack version 2014.1 alias "Icehouse"
    The new OpenStack version "Icehouse" comes with new features and new components, on top of numerous improvements to existing components.
  • Big data tools for midcaps and others
    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.
  • Hadoop for Small-to-Medium-Sized Businesses

    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.

  • Ubuntu Server 14.04 LTS, 64-Bit
    The 64-bit server install image on this month's CD is for computers with the AMD64 or EM64T architecture (e.g., Athlon64, Opteron, EM64T Xeon, Core 2). Ubuntu Server emphasizes scale-out computing, whether you are administering an OpenStack cloud, a Hadoop cluster, or a massive render farm.
  • The New Hadoop

    Hadoop version 2 expands Hadoop beyond MapReduce and opens the door to MPI applications operating on large parallel data stores.

comments powered by Disqus