Blog Archives

An overview of Impala

As enterprises move to Hadoop based data solutions, a key pattern seen is that ‘bigdata’ processing happens in Hadoop land and the resultant derived datasets (such as fine grained aggregates) are migrated into a traditional data warehouse for further consumption. The reason this pattern exists

Tagged with: , , , ,
Posted in Hadoop, Technical

A couple of package management tricks

Few weeks back, I spent some time trying out Cloudera Impala. The target cluster was a 15 node cluster of CentOS VMs running CDH 4.5. The idea of the trial was to gain experience installing and running Impala on a dataset for a

Tagged with: , ,
Posted in Technical

Comparing Apache Tez and Microsoft Dryad

Hortonworks has been blogging about a framework called Tez, a general purpose data processing framework. Reading through the posts, I was reminded of a similar framework that had come from Microsoft Research a while back called Dryad. This blog post is an

Tagged with: , , , , ,
Posted in Hadoop, Technical

Using native Hadoop shell and UI on Amazon EMR

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to

Tagged with: ,
Posted in Cloud, Hadoop, Uncategorized

Elephants in the Clouds

Over the past one year, there have been a lot of new product / project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with enhancements over its base platform, players like Qubole, Mirantis, VMWare, Rackspace

Tagged with: ,
Posted in Hadoop, Technical

Taking memory dumps of Hadoop tasks

On the Hadoop users mailing list, I was recently working with a user on a recurring problem faced by users of Hadoop. The user’s tasks were running out of memory and he wanted to know why. The typical approach to

Tagged with: ,
Posted in Hadoop, Technical

Hadoop filesystem shell subcommand variants

At a client location where I am working as a Hadoop consultant, an upgrade is being planned from a Hadoop 0.20 based version to a Hadoop 2.0 based version. Through these versions, there have been some changes to the hadoop

Posted in api, Hadoop, Technical