Taking memory dumps of Hadoop tasks

On the Hadoop users mailing list, I was recently working with a user on a recurring problem faced by users of Hadoop. The user’s tasks were running out of memory and he wanted to know why. The typical approach to debug this problem is to use a diagnostic tool like jmap, or hook up a profiler to the running task. However, for most Hadoop users, this is not feasible. In real applications, the tasks run on a cluster to which users do not have login access and they cannot debug the task as it runs. I was aware of an option that is provided by the JVM – -XX:+HeapDumpOnOutOfMemoryError that can be passed to the task’s Java command line. This option makes the JVM to take a dump when the task runs out of memory. However, the dump is saved by default to the current working directory of the task, which in Hadoop’s case is a temporary scratch space managed by the framework and it would become inaccessible once the task completes.

At this point, when we were pretty much giving up on options, Koji Noguchi, an ex-colleague of mine and a brilliant systems engineer and Hadoop debugger, responded with a way out. You can read about it here. The few lines mentioned feel almost like magic, so I thought I would write an explanation about how it is working, for people who are interested.

The requirement is to save the dump to a location which is accessible by the user running the task. Such a location, in Hadoop’s case, is … HDFS. So, what we want is to be able to save the dump to HDFS when the scenario happens.

To get there, the first observation is that the JVM offers hooks that can be passed to the task’s command line, for us to act on an OutOfMemory scenario. The options Koji used are:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heapdump.hprof -XX:OnOutOfMemoryError=./copy-dump.sh

These options instruct the JVM to take a dump on OutOfMemory, save it with a name of our choice, and importantly, run a custom script . It should be obvious now that the custom script can copy the generated dump to DFS, using regular DFS commands.

There are a couple of gaps to plug though – as the devil is in the details. How does the script (copy-dump.sh, in the example above) get to the cluster nodes ? For that, we use Hadoop’s feature – Distributed Cache. This feature allows arbitrary files to be packaged and distributed to cluster nodes where tasks require them – using HDFS as an intermediate store. The cached files are usually available to tasks via a Java API. However, since we need this outside the task’s context, we use a powerful option of the Distributed Cache – creating symlinks. This option not only makes the files available to the tasks via the API, but also creates a symbolic link of that file into the current working directory of the task. Hence, when we refer to the script in the task’s command line, we can refer to it relative to the current working directory of the task, i.e. ‘.’.

The specific options to set up all of the above are as follows:

conf.set("mapred.create.symlink", "yes");
conf.set("mapred.cache.files", "hdfs:///user/hemanty/scripts/copy_dump.sh#copy_dump.sh");
conf.set("mapred.child.java.opts","-Xmx200m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heapdump.hprof -XX:OnOutOfMemoryError=./copy_dump.sh");

The last detail in the whole solution is about how to save the dump on HDFS with a name that is unique. Because, realize that multiple tasks are running together at the same time, and more than one of them could run out of memory. Koji’s scripting brilliance for solving this problem was to use the following script:

hadoop fs -put heapdump.hprof /tmp/heapdump_hemanty/${PWD//\//_}.hprof

The expression ${PWD//\//_} takes the current working directory of the task (from the environment), and replaces every occurrence of ‘/’ with an ‘_’. Nice !

So, using these options, features and diagnostics of Hadoop and the JVM, users can now get memory dumps of their tasks to locations that they can easily access and analyse. Thanks a lot to Koji for sharing this technique, and happy debugging !!!

Tagged with: ,
Posted in Hadoop, Technical
7 comments on “Taking memory dumps of Hadoop tasks
  1. […] post originally appeared in my older blog. I am carrying over a few of those posts which seem to be […]

  2. Surbhi says:

    This is a excellent blog! Is there a way to write dump profile files to HDFS from task which do not fail due to memory exception?

  3. yhemanth says:

    Surbhi, Thank you for your comment. I looked around to see if there is any option exposed by the JVM that does what you require (http://www.oracle.com/technetwork/articles/java/vmoptions-jsp-140102.html). But it appears there is none. Even manually throwing an OutOfMemory did not cause a memory dump to be generated. Sorry – don’t know of an option now.

    BTW, I have moved my blog to another site linked in the first comment on this post. If you’d like, please visit that site for any new information. Thanks!

  4. Krever says:

    I’ve tried described procedure but my dump is copied only partially. I get ~500mb form 1.5gb on hdfs and file have _COPYING_ suffix. Have you any idea what may be a problem?

  5. Hi, I’ve tried your solution and it looks nice, but I can not solve one problem. My heap dump file is copied partially, something interrupts copying process. Probably it is a matter of configuration of yarn. Have you got any idea what might be a problem?

  6. George says:

    Thank you for the post! I understand this should read:

  7. @yhemanth: My container heap size is 2Gb, but the dump is not getting copied fully to HDFS, I’m getting partially copied files ending with –

    Any idea how this can be resolved?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: