Facebook PrestoDB — Full Review

Should you consider using Presto to query your Hadoop?

Introduction

This article is meant to answer 3 questions we asked ourselves:

  • Can PrestoDB be a good alternative to Apache Impala?

Through this article I’ll examine PrestoDB according to the parameters I find relevant and in the summary, I’ll answer each of the questions above.

What is Presto?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

It was developed in Facebook and they use it in production as well as Airbnb, Dropbox, Netflix, Uber, AWS (EMR & Amazon Athena), and more big names.

Documentation

There is a nice documentation in the official website of Presto, but its not as detailed as the documentation of Impala in the Cloudera website. The code is much less documented than the code of Impala.

Community

Since Presto is used in many big internet companies — there is a pretty good community out there about it:

  • 7.2k stars on Github (compared to 2k for Impala)

We can still see that while Presto is more popular on github, Impala has much more google results and a much better popularity ranking according to db-engines (whice considers Google Trends, the number of job offers related, linkedin mentions, tweets and tech discussion).

Update (thanks to James Taylor’s comment): Impala is an Apache project so activity happens on the Apache dev and user mailing lists. The project is only mirrored to Github, so it’s not typical that users would star the mirror.

Support

Currently, the only reliable enterprise support for Presto is by a company named Starburst who also contributed to the code of Presto and are recommended in the official Presto website.

Presto Architecture

There are 2 main components in Presto: Coordinator and Worker.

Unlike Apache Impala, it’s not common to use more than 1 coordinator per cluster, although from what I’ve checked it’s possible to configure multiple coordinators.

The architecture is simple:

The coordinator is made of 3 components: Parser, Planner and Scheduler.

Its pretty similar to Impala as they both based on Google Dremel.

Presto has connectors that provide Metadata API for the parser, Data location API for the scheduler and Data stream API for the workers in order to perform queries above multiple data sources. A single query can read data from multiple sources — that’s the main advantage of Presto, it’s super-pluggable.

Hadoop Integration — Hive Connector

The Hive connector allows querying data stored in HDFS and mapped with Hive Metastore.

It’s obvious but the execution engine is not Apache Hive or MapReduce, it’s Presto.

The supported file types are: ORC, Apache Parquet, Avro, RCFile, SequenceFile, JSON, Text.

It supports Apache Hadoop 2.x and CDH 4 and 5. But Cloudera doesn’t support Presto in the Cloudera Manager (because they have Impala).

Because Presto is so pluggable, it is possible to configure multiple Hadoop clusters to query from.

Metadata Caching Mechanism

Presto has a metadata caching mechanism that is very poorly documented and not really discussed on the web. I’ll write here what I found and how I understood it.

Presto’s metadata cache is in the coordinator, so it doesn’t suffer cache consistency problem if user only changes objects via Presto. However, if user also changes objects in the metastore via Hive, it suffers from inconsistency until a refresh occurs.

Presto doesn’t have a REFRESH statement like Impala has, instead there are 2 parameters in the Hive connector properties file:

  • hive.metastore-refresh-interval

It means that for each entry in the cache, a refresh occurs every X time and after Y time it’ll be evicted from the cache. In release 0.174 the default value of those parameters was set to 0 to “disable cross-transaction metadata caching”.

The fact that there’s a refresh interval and a cache timeout has its pros and cons:

Pros:

  • No need for a metadata refresh management like we have in Impala.

Cons:

  • A refresh may be heavy and unnecessary refreshes can be a bottleneck, maybe performing the refreshes ourselves is a more efficient solution.

The Presto metadata cache adopts a Guava based LRU cache which has a default expiration of 1h and default max entry of 10,000 (tunable). If we want to learn more about Presto’s metadata cache mechanism, we can read about the Guava cache.

As I said, the documentation is poor on this subject and even the 2 parameters I’ve talked about are not documented in the Presto website, I found them in discussions on the web.

Parquet Optimized

Presto has 2 parquet readers, the first one was developed by the Presto team and the second one added in release 0.138 was developed in Uber.

In an Uber blog post about Presto and Apache Parquet, they say that they’ve chosen Parquet as their columnar format to achieve high performance. But the Presto’s Parquet reader was not leveraging all the advantages of Parquet and they wrote their own one.

The Uber Parquet reader is not enabled by default and can be enabled through the config property: hive.parquet-optimized-reader.enabled=true

The improvements of the new Parquet reader are:

  • Nested column pruning

The new parquet reader of Presto is anywhere from 2–10x faster than the original one.

To enable Parquet predicate pushdown there is a configuration property: hive.parquet-predicate-pushdown.enabled=true

In an AWS conference we’ve been to, the recommended file format for Presto was Parquet, so I guess Presto is overall pretty optimized for the Parquet format.

Leverage Statistics

Currently the only connector that supports statistics is the Hive connector.

Presto leverages the table statistics of Hive if available, and there is no way to compute statistics in Presto itself (unlike Impala). Collecting table statistics is done through Hive.

The Parquet format has column-level statistics in its foster and the new Parquet reader is leveraging them for predicate/dictionary pushdowns and lazy reads.

Data Locality

Presto can support data locality when the workers are installed on the same machines as the HDFS data nodes. If you set`node-scheduler.network-topology=flat`, Presto will try to schedule splits on the machines that have the data. See https://prestosql.io/docs/current/admin/properties.html#node-scheduler-network-topology

But it’s important to understand that data locality as a concept is getting much less relevant with the bandwidth today (10GbE). In my opinion, compute and storage should be completely separated, at least when it comes to ad-hoc queries. So if we would use Presto it would be in a separated cluster that is dedicated to Presto.

Query Interface

To query Presto, one can use JDBC/ODBC, which means every sql client and programming language can query Presto.

But the specific tools I think analysts should use to query Presto are:

  1. DataGrip — great JetBrains product we already use to query Impala.

Administration

As mentioned in the ‘Hadoop Integration’ section, Cloudera doesn’t support Presto in their Cloudera Manager. Therefore, if we want to use Presto we need to use a proper administration tool.

The 2 main tools I’ve found for Presto administration are:

  1. The Presto Web UI: Presto provides a web interface for monitoring and hanging queries. It basically gives the same abilities and shows the same data as the Impala web UI (at the coordinator level).
  • Install and uninstall Presto across your cluster

Summary

Presto can be an alternative to Impala. The reason is simple: it’s an MPP engine designed for the exact same mission as Impala and has many major users including Facebook, Airbnb, Uber, Netflix, Dropbox, etc. It’s main advantage over Impala is the fact that it is super-pluggable.

But in terms of performance over Hadoop — Impala in its newest version may perform better than Presto.

After everything I’ve read on the web about Presto, I’m convinced we should properly test it in a semi-prod environment and see its actual performance.

I like data-backed answers