Blog Archives

An overview of Impala

As enterprises move to Hadoop based data solutions, a key pattern seen is that ‘bigdata’ processing happens in Hadoop land and the resultant derived datasets (such as fine grained aggregates) are migrated into a traditional data warehouse for further consumption. The reason this pattern exists

Tagged with: , , , ,
Posted in Hadoop, Technical

Comparing Apache Tez and Microsoft Dryad

Hortonworks has been blogging about a framework called Tez, a general purpose data processing framework. Reading through the posts, I was reminded of a similar framework that had come from Microsoft Research a while back called Dryad. This blog post is an

Tagged with: , , , , ,
Posted in Hadoop, Technical

Using native Hadoop shell and UI on Amazon EMR

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to

Tagged with: ,
Posted in Cloud, Hadoop, Uncategorized

Elephants in the Clouds

Over the past one year, there have been a lot of new product / project announcements related to running Hadoop in Cloud environments. While Amazon’s Elastic MapReduce continued with enhancements over its base platform, players like Qubole, Mirantis, VMWare, Rackspace

Tagged with: ,
Posted in Hadoop, Technical

Taking memory dumps of Hadoop tasks

On the Hadoop users mailing list, I was recently working with a user on a recurring problem faced by users of Hadoop. The user’s tasks were running out of memory and he wanted to know why. The typical approach to

Tagged with: ,
Posted in Hadoop, Technical

Hadoop filesystem shell subcommand variants

At a client location where I am working as a Hadoop consultant, an upgrade is being planned from a Hadoop 0.20 based version to a Hadoop 2.0 based version. Through these versions, there have been some changes to the hadoop

Posted in api, Hadoop, Technical

Setting up a single node Hadoop cluster based on trunk

Apache Hadoop is an open source project that provides the capability to store and process petabytes of data. The project releases tested software that can be downloaded and used on clusters of varying sizes. However, this post is a compilation

Tagged with:
Posted in Hadoop, Technical