So recently my team and I started writing down lessons and conclusions from every issue we had with Spark. In this post I’m going to give you 5 interesting tips, that are quite specific, and you may face some issues in which knowing those tips can solve the problem. I hope you’ll find them valuable.
When reading a Hive table made of Parquet files, you should notice that Spark has a unique way of relating to the schema of the table. As you may know, the Parquet format stores the table schema in its footer. …
In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details.
As you may know by now (if you’ve read my previous posts), in my organization — we keep a lot of our data in HDFS. Most of it is the raw data but a significant amount is the final product of many data enrichment processes. In order to manage all the data pipelines conveniently, the default partitioning method of all the Hive tables is hourly DateTime partitioning (for example: dt=’2019041316’).
My personal opinion about the…
In this post, I’ll explain how we used the ELK stack (actually just Elastic and Kibana) to analyze the usage of our users on top of the Hadoop cluster. As you might know if you’ve read my previous posts, our SQL-over-Hadoop solution is Apache Impala. But this post isn’t about Impala, it’s relevant to a lot of technologies, even outside the Hadoop ecosystem.
Sometimes, especially in big organizations, there can be many users who consume data from your lake. “Many” can range from dozens to over a hundred, or, like in our case, thousands. Of course it all depends on…
Recently, we started using Looker for ad-hoc queries of business users in the company (PMs, managers, marketing, and basically everyone). One main problem we have though, is that our Looker performance is pretty bad — queries just take too long.
The performance isn’t bad because of Looker of course, but because of the underlying engine: Trino (formerly PrestoSQL) over parquet files in S3. Simple fetch queries take 10–12 seconds, and more complicated ones take over 30s — we realized we should find a different solution.
Recently I noticed how much I love Spotify’s product, and every month when I see the monthly charge of 19.90 NIS (the Israeli currency), I think to myself: “that’s fucking worth it”. I wonder how many of you who have Spotify (Premium) agree with me. I don’t think I’ll ever cancel my subscription — and when I realized that’s actually what I think, I figured I should check if the company itself is an investment opportunity or not.
Because Spotify is a company that’s built on subscriptions, the first thing I checked is the churn rate of their Premium users…
Yesterday I had a 90 minutes e-meeting with Greg Rahn, a product manager on the team at Cloudera that contributes to Apache Impala. I want to thank him here for his time, it’s truly awesome to be the user of a product with a PM like Greg.
To our conversation I came prepared with a bunch of questions and we discussed each and every one of them to details.
In this post I’ll write a detailed summary on each question and answer — let’s start.
In this post I’m going to write what are the features I reckon missing in Impala. We take Impala to the edge with over 20,000 queries per day and an average HDFS scan of 9GB per query (1,200 TB scanned/week). That’s why we face some issues that other users don’t face and I’m going to write about some of them here.
This is a really basic feature I would expect Impala to have by now (Impala 3.0) but they still don’t have it. Every piece of metadata (a.k.a …
In this post I’ll describe a weird problem we had with our Impala service and how we investigated it, solved it and the conclusions from the whole experience. In my opinion this is relevant not only for Impala but for every processing platform that operates over HDFS.
We have a Kibana dashboard with cool charts we’ve built that show us interesting data on the Impala queries from the last 14 days. Maybe I’ll write a post in the future about how we do BI on our Impala performance with the ELK stack.
One of the charts in the dashboard shows…
In this article I’m going to explain how did we solve the problem of selective MPP (Impala, Presto, Drill, etc.) queries over >10TB tables without performing a full table scan.
This article will describe the idea of what we call “partition index” in a very simple way. The detailed architecture and implementation are subjects for a whole new open-source project.
Partitions are a great optimization if we know which columns we’re going to filter by and what kind of questions are going to be asked on that table.
But sometimes we don’t know what are going to be the most…
So you have your Hadoop, terabytes of data are getting into it per day, ETLs are done 24/7 with Spark, Hive or god forbid — Pig. And then after the data is in the exact shape you want it to be (or even before that) and everything is just perfect — analysts want to query it. If you chose Impala for that mission, this article is for you.
We use Impala for a few purposes:
I like data-backed answers