As enterprises move to Hadoop based data solutions, a key pattern seen is that ‘bigdata’ processing happens in Hadoop land and the resultant derived datasets (such as fine grained aggregates) are migrated into a traditional data warehouse for further consumption. The reason this pattern exists is because, until relatively recently, Hadoop widely remained a batch processing system, which meant it was not very suitable for exploratory analytics that is common in standard Business Intelligence applications. The high latencies involved in batch oriented Hadoop jobs, and the gap vis-a-vis standard SQL compatibility were some of the main reasons for this. However, as it became clear that Hadoop was here to stay, both the Hadoop vendors (and community) and the enterprise data warehousing vendors realised the need to close these gaps.
Within the Hadoop community, this manifested itself in several independent efforts to bring low latency, SQL based query experience onto Hadoop. This included efforts to support standard SQL and low query latencies that users come to expect with a conventional database solution. The class of such solutions are generally referred to as “SQL on Hadoop” solutions. Cloudera’s Impala is a prominent player in this space. Other systems include the Stinger initiative led by Hortonworks, Hawq from Pivotal, Polybase from Microsoft, Presto from Facebook, etc.
This post is an overview of Impala. A tl;dr summary could be that Impala employs distributed query processing – a technique wherein a distributed query plan is generated and distributed to multiple nodes that process the data in parallel and aggregate final results to the user. Several other optimisations contribute to the speed up.
The following picture illustrates the main components in an Impala solution:
Broadly, the components can be organised into four layers:
This layer contains components that store the data to be queried. The components here are not really Impala components, but the regular components that store data in Hadoop, such as the Datanodes of HDFS and HBase region servers.
This is the most important layer in the solution.
The query engine comprises of the core Impala component, called Impalad. This daemon is primarily responsible for executing queries. In a cluster, typically there are as many daemons as datanodes. Impala has a peer-to-peer architecture, i.e. there is no designated master. Clients can connect to any daemon to issue queries. The daemon to which a query is issued is called the ‘coordinator‘. The coordinator distributes the work of executing a query to multiple Impala daemons, and aggregates results from them to return to the client.
In their turn, the participating peer Impala daemons perform the function of executing the query fragment on the data assigned to them, and sending the results back to the coordinating Impala daemon.
The statestored is another Impala component that tracks the availability of Impala daemons, and transmits their health status to other daemons. The Impala daemons use this information to avoid trying to distribute query work to unreachable or unhealthy nodes.
This layer contains components that hold information about the data.
The Namenode is the HDFS master that holds information about the data stored on datanodes.
The Hive metastore is another important component in an Impala cluster. This is the component that provides Impala with information on the SQL schemas – the databases, tables, column definitions, storage formats, etc – on the basis of which queries are compiled and executed. This implies that Impala can query tables that are already defined in Hive. Metadata of any new tables created via Impala is also stored in the same metastore.
The catalogd daemon is another core Impala component that is responsible for synchronising metadata information amongst the Impala daemons and between the other metadata components and Impala daemons.
The consumption layer includes components that fire queries and retrieve back results. The Impala shell is akin to any SQL shell utility. Using it, one can connect to an Impala daemon and fire queries at it. The JDBC / ODBC drivers enable connectivity for programmatic integrations, such as from custom web applications, or existing Business Intelligence solutions. Hue is a graphical UI tool from Cloudera that provides a SQL workbench like functionality.
Let us consider collaborations between Impala components in the context of two main flows: executing a query, and synchronising metadata.
Executing a query
- As described earlier, the Impala client connects to an Impala daemon to issue the query. This Impala daemon logically becomes the ‘coordinator’ for this query
- The daemon analyses the query, generates a distributed query plan, and breaks the plan up so as to distribute work amongst its peers.
- In order to cause an efficient distribution, the daemon relies on metadata information that it would have cached from interactions with the catalogd service. This metadata information will, for example, talk about block locations on datanodes where the data is located.
- Work is now distributed to multiple Impala daemons. The coordinator relies on information from statestored to know which daemons are unhealthy and avoid sending work to them.
- The worker Impala daemons will execute the partial queries assigned to them.
- In the process, they read data from the underlying storage systems – i.e. datanodes or HBase. For reading from datanodes, one interesting point is that they use a HDFS feature called “short-circuit reads“. When enabled, this feature allows HDFS clients co-located with a datanode to read directly using a Unix domain socket shared between the datanode and the client, rather than go the way of a typical Hadoop RPC call for read. Details of this feature are here.
- Once each daemon has executed its part of the query, the results are sent back to the coordinator, which aggregates the results and returns them back to the client.
- The catalogd process is responsible for looking up metadata information from HDFS and Hive metastore, and broadcasts it to the Impala daemons at startup time.
- The daemons cache this information, and use it for query planning as described above.
- When a DDL operation is issued by an Impala client, this is handled by the catalogd process, which updates Hive metastore and also broadcasts changes to other daemons.
- As Impala can also query tables defined using Hive, it needs to synchronise changes that happen via Hive as well. Currently, there is no automated way of doing this. The user needs to issue a special command to invalidate or refresh metadata.
- When these synchronisation commands are issued, again the catalogd process manages this globally, refreshes the metadata from underlying systems and broadcasts changes to other daemons.
The Hadoop community and vendors are striving to make low latency SQL on Hadoop a reality. In Cloudera’s case, Impala offers to achieve this via several features such as distributed query processing, minimising avoidable I/O by eliminating the slow Map/Reduce model of traditional Hadoop, optimisations like HDFS short circuit reads, aggressive caching of metadata, relying on table and column statistics, and so on. The success of these platforms will go a long way in simplifying access to big data stored on Hadoop systems.
Most of this information is from Cloudera’s documentation of Impala.
For clarifying parts of the interactions between Impala components, chiefly on synchronising metadata, I had written on the impala-user mailing list, and received helpful responses from Marcel Kornacker, the architect of Impala.