Interesting and specific lessons learned from experience

So recently my team and I started writing down lessons and conclusions from every issue we had with Spark. In this post I’m going to give you 5 interesting tips, that are quite specific, and you may face some issues in which knowing those tips can solve the problem. I hope you’ll find them valuable.

1) Parquet schema Vs. Hive Metastore in SparkSQL

When reading a Hive table made of Parquet files, you should notice that Spark has a unique way of relating to the schema of the table. As you may know, the Parquet format stores the table schema in its footer. …


Our solution to the Hadoop small files problem

In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details.

A little background

As you may know by now (if you’ve read my previous posts), in my organization — we keep a lot of our data in HDFS. Most of it is the raw data but a significant amount is the final product of many data enrichment processes. In order to manage all the data pipelines conveniently, the default partitioning method of all the Hive tables is hourly DateTime partitioning (for example: dt=’2019041316’).

My personal opinion about the…


Why is it so important to track your users’ queries, and how do we do it?

In this post, I’ll explain how we used the ELK stack (actually just Elastic and Kibana) to analyze the usage of our users on top of the Hadoop cluster. As you might know if you’ve read my previous posts, our SQL-over-Hadoop solution is Apache Impala. But this post isn’t about Impala, it’s relevant to a lot of technologies, even outside the Hadoop ecosystem.

The Importance Of Monitoring Your Users

Sometimes, especially in big organizations, there can be many users who consume data from your lake. “Many” can range from dozens to over a hundred, or, like in our case, thousands. Of course it all depends on…


A short description of Firebolt, for not-so-technical people

Recently, we started using Looker for ad-hoc queries of business users in the company (PMs, managers, marketing, and basically everyone). One main problem we have though, is that our Looker performance is pretty bad — queries just take too long.

Harry Potter receives his own Firebolt from Sirius Black

The performance isn’t bad because of Looker of course, but because of the underlying engine: Trino (formerly PrestoSQL) over parquet files in S3. Simple fetch queries take 10–12 seconds, and more complicated ones take over 30s — we realized we should find a different solution.

We considered using Snowflake, as it’s already being used in the company by another department…


A human-readable stock analysis, from a rational perspective

Recently I noticed how much I love Spotify’s product, and every month when I see the monthly charge of 19.90 NIS (the Israeli currency), I think to myself: “that’s fucking worth it”. I wonder how many of you who have Spotify (Premium) agree with me. I don’t think I’ll ever cancel my subscription — and when I realized that’s actually what I think, I figured I should check if the company itself is an investment opportunity or not.

Because Spotify is a company that’s built on subscriptions, the first thing I checked is the churn rate of their Premium users…


Q&A session on specific issues that bothered us

Yesterday I had a 90 minutes e-meeting with Greg Rahn, a product manager on the team at Cloudera that contributes to Apache Impala. I want to thank him here for his time, it’s truly awesome to be the user of a product with a PM like Greg.

To our conversation I came prepared with a bunch of questions and we discussed each and every one of them to details.

In this post I’ll write a detailed summary on each question and answer — let’s start.

  1. Will it be possible in the future to upgrade Impala separately from the whole CDH?
    He…


A letter to the developers and product manager of Impala

In this post I’m going to write what are the features I reckon missing in Impala. We take Impala to the edge with over 20,000 queries per day and an average HDFS scan of 9GB per query (1,200 TB scanned/week). That’s why we face some issues that other users don’t face and I’m going to write about some of them here.

1. Metadata Cache TTL

This is a really basic feature I would expect Impala to have by now (Impala 3.0) but they still don’t have it. Every piece of metadata (a.k.a …


Why Small Frequently-Queried Tables Shouldn’t Be Stored In HDFS?

In this post I’ll describe a weird problem we had with our Impala service and how we investigated it, solved it and the conclusions from the whole experience. In my opinion this is relevant not only for Impala but for every processing platform that operates over HDFS.

A really hot spot, not in Hadoop

Something’s wrong

We have a Kibana dashboard with cool charts we’ve built that show us interesting data on the Impala queries from the last 14 days. Maybe I’ll write a post in the future about how we do BI on our Impala performance with the ELK stack.

One of the charts in the dashboard shows…


How to make your selective queries run 100x faster?

In this article I’m going to explain how did we solve the problem of selective MPP (Impala, Presto, Drill, etc.) queries over >10TB tables without performing a full table scan.

This article will describe the idea of what we call “partition index” in a very simple way. The detailed architecture and implementation are subjects for a whole new open-source project.

The Problem

Partitions are a great optimization if we know which columns we’re going to filter by and what kind of questions are going to be asked on that table.

But sometimes we don’t know what are going to be the most…


How did we make our Impala run faster?

So you have your Hadoop, terabytes of data are getting into it per day, ETLs are done 24/7 with Spark, Hive or god forbid — Pig. And then after the data is in the exact shape you want it to be (or even before that) and everything is just perfect — analysts want to query it. If you chose Impala for that mission, this article is for you.

Impala In Our Data Lake

We use Impala for a few purposes:

  • Let analysts query new types of data that the data engineers haven’t created any ETLs on yet.
  • Let analysts query data that its final destination…

Adir Mashiach

I like data-backed answers

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store