Few weeks back, I spent some time trying out Cloudera Impala. The target cluster was a 15 node cluster of CentOS VMs running CDH 4.5. The idea of the trial was to gain experience installing and running Impala on a dataset for a client project, and see how it measures against our use cases. I will hopefully cover the results of the trial in a future post.
In this post, I am going to write about two package management tricks I learnt as part of the installation. The tricks are not specific to installing Impala, and hence will be widely applicable. Therefore, I hope this post will be of some use to folks reading it.
Repeating a multi-file install on multiple machines
The first task was, of course, to install Impala on multiple machines. Using Cloudera Manager would possibly have made this task easier. However, we were not using it. If we were setting up from scratch, we could also have set up one machine and cloned it. However, the cluster was not only setup, but was being used in parallel for other purposes and we did not want to disturb the existing set up.
To install Impala on these CentOS machines, I had to use the yum package management tool, and point to Cloudera’s yum repositories. The install was installing more than 100 new packages and upgrading more than 10 of them from the Internet – overall a lot of bits to pull down onto multiple machines. I wanted to do this more efficiently. This led me to learn about setting up local yum repositories.
A local yum repository is a local directory containing all the required packages. This link had all the necessary background material to understand how to set one up. In order, I did the following:
- Install a plugin to yum called yum-downloadonly.
- With this, execute a command to just download, but not install, the RPM files:
yum install <package> –downloadonly
- Run the ‘createrepo’ command on the directory containing the downloaded RPM files, like so:
createrepo <path to downloaded packages>
- Create a yum configuration file pointing to this folder. This file can look like below:
name=Local Repository for Impala Packages
- Copy this configuration file to the location of yum configuration files – typically /etc/yum.repos.d
- Trigger an install as follows:
yum –disablerepo=”*” –enablerepo=”local_impala_repo” install
The –disablerepo=”*” was required to prevent yum from using existing public repositories under /etc/yum.repos.d for the default packages. –enablerepo option enables the local repository set up above, making it the only source of packages available to yum.
Note that if the path to which the RPM files are copied is a path accessible from all machines – like an NFS mounted file system – you can download the files just once and use it from all machines.
Extracting files from a package
Impala requires Hadoop’s native library, libhadoop.so, for its functioning. The systems I was installing Impala on had CDH installed using tarballs which do not contain the native libraries. Hence, I needed to get these separately and install them. Since these are native libraries, one needs to be careful to get the versions compatible with the system being used. The best way to do this is to build from source. If you are lazy like me, you may want to get these from a compatible RPM like the ones available here (for CentOS 6, 64-bit, CDH 4.5). The task becomes a little more interesting if some of the files provided by the RPM are already installed on the system (through other means). This being the case, I wanted to see if there was a way to extract files from the RPM without actually installing it, so that I could just pick what I wanted. Which led me to learn about cpio and rpm2cpio.
The cpio utility copies files between archives of a set of supported archive types. Like many standard Unix tools, it is also capable of reading archive files from standard input and write archives to standard output. One of the supported archive types is the ‘cpio’ format itself.
The rpm2cpio utility extracts files provided by the RPM into the cpio format that is written out to standard output. Composition is straightforward now:
rpm2cpio <name of rpm> | cpio -idm
This pipeline extracts all files provided by an RPM relative to the current working directory using the cpio format as an intermediate transformation format. One can now easily pick the specific files required from the extracted contents.
In summary, the two tricks I learnt over the course of a day were immensely helpful in getting a working Impala installation on multiple machines. A reliable search engine, tools that follow Unix’s design principles of composition, configurability and usability, and immensely helpful documentation from unknown authors (in the form of blogs, howtos, etc.) who took the time to document these tidbits for the benefit of others made this possible and easy. Of these, I’d rate the documentation as the most important. And this post is hopefully an attempt to pay forward the gratitude.