5 Main Missing Features in Impala (Opinion)

A letter to the developers and product manager of Impala

In this post I’m going to write what are the features I reckon missing in Impala. We take Impala to the edge with over 20,000 queries per day and an average HDFS scan of 9GB per query (1,200 TB scanned/week). That’s why we face some issues that other users don’t face and I’m going to write about some of them here.

This is a really basic feature I would expect Impala to have by now (Impala 3.0) but they still don’t have it. Every piece of metadata (a.k.a CatalogObject) that is being cached in the coordinators will stay there until an INVALIDATE METADATA (or a restart) will occur.

That situation causes our metadata cache and topic size to grow and grow until it’s way too big (topic size > 4GB) and then the statestore’s performance is dramatically decreasing (i.e. refreshes are getting very slow) and finally it crushes.

I already mentioned in a previous post that a best practice for Impala is not to have so much metadata but sometimes this is simply the case.

The solution for this problem is of course a TTL (time-to-leave) mechanism that will clear catalog objects from the cache after a certain amount of time. Presto has it. Every normal caching mechanism has it.

We have implemented a smart solution of our own that is clearing tables metadata from the cache if they haven’t been queried in the last 4 hours. That way only the relevant tables are in the cache. I might write a full post about this solution in the future.

I don’t know how to call that missing feature but the main problem it should solve is the amount of data scanned by certain queries. I already talked about that issue in a previous post (partition index) and I explained how we solved it but I think Impala should add a feature that solves it from the inside.

Impala doesn’t have indexes, so when I want to query a very big table on a certain column value (i.e. full name, page title, phone number, url, etc.) I will have to scan the whole table unless the table is partitioned by that column. But some columns have too many distinct values and a table can’t have millions of partitions.

The solution for that problem in Hive is called Bucketing, and in Oracle it is called Hash Partitions.

Google BigQuery solved this problem recently with a feature called clustering, you can read about it here.

The bottom line is: Impala has to come up with a better solution (than simple partitioning) to that problem (queries that process a lot of unnecessary data — full table scan) or it will stay behind the industry.

UPDATE: This sorting feature from Impala 2.9 is Impala’s answer to that problem.

The query duration measurement in Impala is a problematic issue. If a query is finished but the session is still open, the query duration will keep growing until the session will be closed or timed out.

That issue makes our analysis of query performance and user experience almost impossible. We can’t know the actual duration of the queries that were submitted through Hue or DataGrip. We only know the query duration of the queries that are made through our BI tools (which automatically close the session after a query is done).

If most of the queries in the organization are made through Hue or another ODBC/JDBC interface like DataGrip — that’s a big problem. Because then we can’t even answer a simple question like what is the average query duration.

Impala has to add an attribute to the query that shows the actual execution duration time and not the time that the session was open.

I think it’s time that Impala would have an official proper REST API for queries. Presto has it and it’s amazing and very important for some of our developers.

Presto’s REST API is good because it’s asynchronous — you submit a query and get a status of the query as a response. And in the response you get the URI to check the progress of the query until finally you get the results when they’re available.

Our developers can’t query Impala through Node.js and it’s a big problem because they need to build a Flask application that uses impyla and exposes a REST API. That’s bad because it’s not asynchronous and it’s a weird overhead for the developers.

Again, in this aspect — Presto is better than Impala until they will have this feature.

This part is about troubleshooting Impala. I don’t remember too many times that we had issues and the logs (ERROR, WARNING or INFO) were helpful. Sure, sometimes the logs are indicative and we can diagnose the problem just from reading them. But most of the times we look at the queries’ exceptions, the metrics of the hosts and the queries themselves to diagnose a problem.

Another thing is that there is not much information about the guts of Impala and the query execution process. I think there should an “Impala under the hood” guide. It’ll help administrators, like myself, understand how each and every component is working in order to be able to read logs, diagnose issues and optimize performance properly.

This may not be a feature but it’s a really important thing to the customers of Impala.

I wrote this post for two main reasons:

  • Add more important knowledge about Impala to the community
  • Let the Impala developers / product managers know what I think should be improved.

I hope this was helpful and like every post of mine — the main purpose is discussion. If you have something to add or you think something I wrote here is wrong, please comment and start a discussion, it’ll contribute a lot to the community.

Finally, it’s important to say I think Impala is the best solution for SQL-on-Hadoop, but they should make some major improvements if they want to stay relevant.

I like data-backed answers