At a client location where I am working as a Hadoop consultant, an upgrade is being planned from a Hadoop 0.20 based version to a Hadoop 2.0 based version. Through these versions, there have been some changes to the hadoop fs subcommand and users have been asking about what the differences are among the variants. This post aims to answer some of those questions.
There are three variants that exist currently:
- hadoop fs
- hadoop dfs
- hdfs dfs
All of these commands provide a CLI to the configured filesystem in Hadoop, and have options that mimic various file operations like ls, put, get, mv, cat, etc. To folks who have played around with all of these, the results would have appeared more or less the same, barring a deprecated warning message in some cases.
The first thing to note is that hadoop dfs was the original shell subcommand. The variant hadoop fs was introduced via HADOOP-824. As mentioned in the JIRA, the change was made to reflect the nature of Hadoop’s generic FileSystem API – which provides a service provider like interface that can be implemented by any filesystem implementation. Indeed, there are multiple filesystems supported by Hadoop – including the local filesystem, HDFS (Hadoop’s native distributed filesystem), Amazon S3 and so on. For some reason – most likely, backwards compatibility – the hadoop dfs subcommand was not deprecated nor removed as part of this change. Hence, from this point onwards, both the subcommands could be used to access the same underlying filesystem.
Sometime after Hadoop 0.20, the community made a decision to split the Hadoop project into three separate projects – common, hdfs and mapreduce. While this split itself was partially reverted due to various reasons, a few changes made as part of the split were retained in the codebase. Specifically, HADOOP-4868 split the single hadoop shell script into two other scripts, including hdfs. As part of this split, subcommands that seemed relevant to the distributed filesystem alone were moved to the hdfs script. The dfs subcommand was also moved there, although the fs subcommand was retained in the hadoop script itself. hadoop dfs was supported by calling out to the hdfs script, and printing a deprecation warning.
One notable point is that in order to execute any filesystem commands, the classpath needs to refer to the classes that implement the filesystem. In Hadoop, as of this writing, all filesystem implementations, other than HDFS, sit in the hadoop-common jar. The HDFS implementation, on the other hand, lives in its own jar. An important decision made by the community in HADOOP-4868 was to include the HDFS jars in the classpath of the hadoop script, so that the fs subcommand can interact with an underlying HDFS filesystem – which is the most typical installation of Hadoop. The hdfs script also shares its classpath configuration with the hadoop script.
What that really means is that as far as HDFS is concerned, hadoop fs and hdfs dfs are both properly supported usages without any difference. Even for the other file systems, currently the two variants behave the same way. However, it is not unimaginable that in future, other filesystem implementations get out of the hadoop-common jar, and the hdfs script then doesn’t refer to other filesystem implementations, thereby making hdfs dfs an exclusive HDFS only subcommand.
My personal preference would be to use hadoop fs as it is more generic. However, it isn’t too wrong to use hdfs dfs if we are very sure that our Hadoop filesystem of choice isn’t going to change. hadoop dfs is best avoided, as it has already been deprecated.