Open in app

Sign In

Write

Sign In

Adir Mashiach
Adir Mashiach

280 Followers

Home

About

Pinned

Apache Spark: 5 Performance Optimization Tips

Interesting and specific lessons learned from experience — So recently my team and I started writing down lessons and conclusions from every issue we had with Spark. In this post I’m going to give you 5 interesting tips, that are quite specific, and you may face some issues in which knowing those tips can solve the problem. …

Yarn

5 min read

Apache Spark: 5 Performance Optimization Tips
Apache Spark: 5 Performance Optimization Tips
Yarn

5 min read


Pinned

Partition Management in Hadoop

Our solution to the Hadoop small files problem — In this post I’ll talk about the problem of Hive tables with a lot of small partitions and files and describe my solution in details. A little background As you may know by now (if you’ve read my previous posts), in my organization — we keep a lot of our data in HDFS…

Big Data

9 min read

Partition Management in Hadoop
Partition Management in Hadoop
Big Data

9 min read


Pinned

Defend Your Infrastructure — Handling 3,000 Hungry Users

Why is it so important to track your users’ queries, and how do we do it? — In this post, I’ll explain how we used the ELK stack (actually just Elastic and Kibana) to analyze the usage of our users on top of the Hadoop cluster. As you might know if you’ve read my previous posts, our SQL-over-Hadoop solution is Apache Impala. …

Big Data

5 min read

Defend Your Infrastructure — Handling 3,000 Hungry Users
Defend Your Infrastructure — Handling 3,000 Hungry Users
Big Data

5 min read


May 23, 2021

Firebolt — the new kid on the (data warehousing) block

A short description of Firebolt, for not-so-technical people — Recently, we started using Looker for ad-hoc queries of business users in the company (PMs, managers, marketing, and basically everyone). One main problem we have though, is that our Looker performance is pretty bad — queries just take too long. The performance isn’t bad because of Looker of course, but…

Firebolt

3 min read

Firebolt — The new kid on the (data warehousing) block
Firebolt — The new kid on the (data warehousing) block
Firebolt

3 min read


Dec 26, 2020

SPOT: Is Spotify a good stock to buy?

A human-readable stock analysis, from a rational perspective — Recently I noticed how much I love Spotify’s product, and every month when I see the monthly charge of 19.90 NIS (the Israeli currency), I think to myself: “that’s fucking worth it”. I wonder how many of you who have Spotify (Premium) agree with me. …

Spotify

6 min read

SPOT: Is Spotify a good stock to buy?
SPOT: Is Spotify a good stock to buy?
Spotify

6 min read


Aug 31, 2018

Impala Discussion With The Product Manager (Greg Rahn)

Q&A session on specific issues that bothered us — Yesterday I had a 90 minutes e-meeting with Greg Rahn, a product manager on the team at Cloudera that contributes to Apache Impala. I want to thank him here for his time, it’s truly awesome to be the user of a product with a PM like Greg. To our conversation…

Presto

5 min read

Impala Discussion With The Product Manager (Greg Rahn)
Impala Discussion With The Product Manager (Greg Rahn)
Presto

5 min read


Aug 15, 2018

5 Main Missing Features in Impala (Opinion)

A letter to the developers and product manager of Impala — In this post I’m going to write what are the features I reckon missing in Impala. We take Impala to the edge with over 20,000 queries per day and an average HDFS scan of 9GB per query (1,200 TB scanned/week). …

Big Data

5 min read

5 Main Missing Features in Impala (Opinion)
5 Main Missing Features in Impala (Opinion)
Big Data

5 min read


Apr 13, 2018

Hotspotting In Hadoop — Impala Case Study

Why Small Frequently-Queried Tables Shouldn’t Be Stored In HDFS? — In this post I’ll describe a weird problem we had with our Impala service and how we investigated it, solved it and the conclusions from the whole experience. In my opinion this is relevant not only for Impala but for every processing platform that operates over HDFS. Something’s wrong We have a…

Big Data

4 min read

Hotspotting In Hadoop — Impala Case Study
Hotspotting In Hadoop — Impala Case Study
Big Data

4 min read


Apr 1, 2018

Partition Index - Selective Queries On Really Big Tables

How to make your selective queries run 100x faster? — In this article I’m going to explain how did we solve the problem of selective MPP (Impala, Presto, Drill, etc.) queries over >10TB tables without performing a full table scan. This article will describe the idea of what we call “partition index” in a very simple way. The detailed architecture…

Impala

4 min read

Partition Index - Selective Queries On Really Big Tables
Partition Index - Selective Queries On Really Big Tables
Impala

4 min read


Mar 20, 2018

Apache Impala: My Insights and Best Practices

How did we make our Impala run faster? — So you have your Hadoop, terabytes of data are getting into it per day, ETLs are done 24/7 with Spark, Hive or god forbid — Pig. And then after the data is in the exact shape you want it to be (or even before that) and everything is just perfect…

Big Data

7 min read

Apache Impala: My Insights and Best Practices
Apache Impala: My Insights and Best Practices
Big Data

7 min read

Adir Mashiach

Adir Mashiach

280 Followers

I like data-backed answers

Following
  • Felipe Hoffa

    Felipe Hoffa

  • Eran Elbaz

    Eran Elbaz

  • Tomer Garber

    Tomer Garber

  • Tech at King

    Tech at King

  • Yaniv Harpaz

    Yaniv Harpaz

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech