An overview of Impala

As enterprises move to Hadoop based data solutions, a key pattern seen is that ‘bigdata’ processing happens in Hadoop land and the resultant derived datasets (such as fine grained aggregates) are migrated into a traditional data warehouse for further consumption. The reason this pattern exists

A couple of package management tricks

Few weeks back, I spent some time trying out Cloudera Impala. The target cluster was a 15 node cluster of CentOS VMs running CDH 4.5. The idea of the trial was to gain experience installing and running Impala on a dataset for a

Comparing Apache Tez and Microsoft Dryad

Hortonworks has been blogging about a framework called Tez, a general purpose data processing framework. Reading through the posts, I was reminded of a similar framework that had come from Microsoft Research a while back called Dryad. This blog post is an

Using native Hadoop shell and UI on Amazon EMR

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to

Elephants in the Clouds

Over the past one year, there have been a lot of new product / project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with enhancements over its base platform, players like Qubole, Mirantis, VMWare, Rackspace

Taking memory dumps of Hadoop tasks

On the Hadoop users mailing list, I was recently working with a user on a recurring problem faced by users of Hadoop. The user’s tasks were running out of memory and he wanted to know why. The typical approach to

Hadoop filesystem shell subcommand variants

At a client location where I am working as a Hadoop consultant, an upgrade is being planned from a Hadoop 0.20 based version to a Hadoop 2.0 based version. Through these versions, there have been some changes to the hadoop

