Elephants in the Clouds

Over the past one year, there have been a lot of new product / project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with enhancements over its base platform, players like Qubole, Mirantis, VMWare, Rackspace all announced product or service offerings marrying the elephant with the cloud. Projects like Apache Whirr, Netflix’s Genie started getting noticed. To me, it has become evident that Hadoop in the cloud is a trending topic. This post explores some of the reasons why this association makes sense, or why customers are seeing increased value in this model.

Why Hadoop on the Cloud makes sense

Usual Suspects

Running Hadoop on the cloud makes sense for similar reasons as running any other software offering on the cloud. For companies still testing waters with Hadoop, the low capacity investment in the cloud is a no-brainer. The cloud also makes sense for a quick, one time use case involving BigData computation. As early as in 2007, New York Times used the power of Amazon EC2 instances and Hadoop for just one day to do a one time conversion of TIFF documents to PDFs in a digitisation effort. Procuring scalable compute resources on demand is appealing as well.

Beating The Red Tape

The last point above of quick resource procurement needs some elaboration. Hadoop and platforms it was inspired from made the vision of linear storage and compute using commodity hardware a reality. Internet giants like Google, who always operated at web-scale, knew that there would be a need for running on more and more hardware resources. They invested in building this hardware themselves.

In the enterprises though, this was not necessarily an option. Hadoop adoption in some enterprises grew organically from a few departments running tens of nodes to a consolidated medium sized or large cluster with a few hundreds of nodes. Usually such clusters started getting managed by a ‘data platform’ team different from the Infrastructure team, the latter being responsible for procurement and management of the hardware in the data centre. As the analytics demand within enterprises grew, the need to expand the capacity of the Hadoop clusters also grew. The data platform teams started hitting a bottleneck of a different kind. While the software itself had proven its capability of handling linear scale, the time it took for hardware to materialise in the cluster due to IT policies varied from several weeks to several months, stifling innovation and growth. It became evident that ‘throwing more hardware at Hadoop’ wasn’t as easy or fast as it should be.

The cloud, with its promise of instant access to hardware resources, is very appealing to leaders who wish to provide a platform that scales fast to meet growing business needs.

Handling Irregular Workloads

Being a batch oriented system, typical usage patterns of Hadoop involve scheduled jobs processing new incoming data on a fixed, temporal  basis. The load on compute resources of a Hadoop cluster varies based on the timings of these scheduled runs or rate of incoming data.  A fixed capacity Hadoop cluster built on physical machines is always on whether it is used or not – consuming power, leased space, etc. and incurring cost.

The cloud, with its pay as you use model, is more efficient to handle such irregular workloads. Given predictability in the usage patterns, one can optimise even further by having clusters of suitable sizes available at the right time for jobs to run.

Handling Variable Resource Requirements

Not all Hadoop jobs are created equal. While some of them require more compute resources, some require more memory, and some others require a lot of I/O bandwidth. Usually, a physical Hadoop cluster is built of homogenous machines, usually large enough to handle the largest job.

The default Hadoop scheduler has no solution for managing this diversity in Hadoop jobs, causing sub-optimal results. For example, a job whose tasks require more memory than average could affect tasks of other jobs that run on the same slave node due to a drain on system resources. Advanced Hadoop schedulers like the Capacity Scheduler and Fairshare Scheduler have tried to address the case of managing heterogenous workloads on homogenous resources using sophisticated scheduling techniques. For e.g. the Capacity Scheduler supported the notion of ‘High RAM’ jobs – jobs that require more memory on average. Such support is becoming more prevalent with Apache YARN, where the notion of a resource is being more comprehensively defined and handled. However, these solutions are still not as widely adopted as the stable Hadoop 1.0 solutions that do not have this level of support.

Cloud solutions meanwhile already offer a choice to the end user to provision clusters with different types of machines for different types of workloads. Intuitively, this seems like a much easier solution for the problem of handling variable resource requirements.

Simplifying Multi-tenancy Requirements

As cluster consolidation happens in the enterprise, one thing that gets lost out is the isolation of resources for different sets of users. As all user jobs get bunched up in a shared cluster, administrators of the cluster start to deal with multi-tenancy issues like user jobs interfering with one another, varied security constraints etc.

The typical solution to this problem has been to enforce very restrictive cluster level policies / limits that prevent users from doing anything harmful to other users jobs. The problem with this approach is that valid use cases of users are also not solved. For instance, it is common for administrators to lockdown the amount of memory Hadoop tasks can run with. If a user genuinely requires more memory, he / she has no support from the system.

Using the cloud, one can provision different types of clusters with different characteristics and configurations, each suitable for a particular set of jobs. While this frees administrators from having to manage complicated policies for a multi-tenant environment, it enables users to use the right configuration for their jobs.

Running Closer to the Data

As businesses move their services to the cloud, it follows that data starts living on the cloud. And as analytics thrives on data, and typically large volumes of it, it makes no sense for analytical platforms to exist outside the cloud leading to inefficient, time consuming migration of this data from source to the analytics clusters.

Running Hadoop clusters in the same cloud environment is an obvious solution to this problem. This is, in a way, applying Hadoop’s principle of data locality at the macro level.

These are some of the reasons why Hadoop clusters on the cloud make sense, and how they open up new possibilities for platform providers, administrators and end users.

The other side of the coin

While the above are some good reasons for considering the cloud, it is not so that one can just flip over and start using the new environment. There are some specific issues that make running Hadoop jobs on the cloud more challenging than other common workloads. I hope to cover these in an upcoming blog post.

Tagged with: ,
Posted in Hadoop, Technical

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: