Defend Your Infrastructure — Handling 3,000 Hungry Users
Why is it so important to track your users’ queries, and how do we do it?
In this post, I’ll explain how we used the ELK stack (actually just Elastic and Kibana) to analyze the usage of our users on top of the Hadoop cluster. As you might know if you’ve read my previous posts, our SQL-over-Hadoop solution is Apache Impala. But this post isn’t about Impala, it’s relevant to a lot of technologies, even outside the Hadoop ecosystem.
The Importance Of Monitoring Your Users
Sometimes, especially in big organizations, there can be many users who consume data from your lake. “Many” can range from dozens to over a hundred, or, like in our case, thousands. Of course it all depends on your definition of a “user”. I define a user as anyone who asks a business question that eventually translates to an SQL query (or queries) over my cluster. Of course there are different kinds of users, and we divide them to four levels:
- Data Scientists (X)
- Well-Trained Analysts (10X)
- Kinda-know-what-they’re-doing users (100X)
- Anyone who wants to ask a business question through our varied BI systems (1,000X)
The X expressions demonstrate the scale of how many of them we have in each level. So you can guess that thousands of users (in our case it’s about 3,000) can cause a good deal of trouble in a cluster.
A little bit about our users and BI systems
We have a few BI systems that users can use in order to ask all sorts of questions. We also have many different stores for different kind of questions, and one of them is our Hadoop cluster. As I said, this post will focus on the Impala use-case, but it’s just as relevant when it comes to other technologies like PostgreSQL, Elastic, Solr, Amazon Athena, Google Big Query and many more.
Our main BI system is amazing, user-friendly, and actually deserves a post of its own. It lets the simple users (levels 3-4) ask any question they want above our different stores, and come to incredible business insights and conclusions. It also has the common, and necessary functionality of scheduling reports.
Scheduling — The Fourth Unforgivable Curse
As J. K. Rowling taught us, there are 3 unforgivable curses — but everyone who was in charge of a cluster or a database knows there is one more curse: scheduling. When users start to schedule their reports, or learn python and start writing scripts with SQL inside them, instead of direct SQL queries — then you should be expecting some troubles.
Of course I’m joking, scheduling is not a bad thing. On the contrary, it’s vital to the users efficient workflow. But you should be ready to face the problems you’ll have as a result of this functionality.
The $625,000 Damage of Abusive Users
So the whole reason I’m writing this post is an outrageous discovery I had last week. We’ve been experiencing bad performance in Impala in the past 3 months. I thought it was because of the many, concurrent and relatively heavy queries that were run in the cluster. The performance was bad, but tolerable, that’s why we didn’t invest much time in a profound investigation of the issue. Last week, I decided to start investigating the situation.
Long story short, I found the stupid reason that caused the bad performance of the last 3 months. It was 1 user, who created 1 scheduled report that generated thousands of Impala queries per hour. The overall data scanned by this scheduled report was over 125 petabytes in the last 3 months. The cost of such use in Amazon Athena or Google Big Query is about $625,000. For a single report of a single user. Glad we have an on-premise cluster.
I couldn’t have found it if we weren’t indexing all our queries to Elastic, with a decent Kibana dashboard on top of that. But even with that index and dashboard, it took me a great effort to get to the bottom of the problem — that specific scheduled report.
How to Analyze Your Users’ Queries
Now that we understand the motivation behind tracking and measuring your users’ queries — let’s dig into how we’ve done it and what did I learn from last week events.
Create an Elastic Index of the Queries
Well that’s probably the smartest thing we’ve done in the scope of administrating our Impala. We’ve built a simple script that queries the CM (Cloudera Manager) API, takes the new Impala queries from the last 15 minutes and puts them in an Elastic index. The CM exposes a lot of interesting attributes of the query, such as hdfs_bytes_read, thread_cpu_time, memory_aggregate_peak, user, request_pool, etc.
Another interesting thing I recommend to add to your queries-to-Elastic script is a little logic analyzing the query and extracts interesting stuff from the SQL statement: table/view names, number of “GROUP BY”s/”ORDER BY”s, etc.
Now that we have such index, we can ask many interesting questions about the usage of our infrastructure. For example, we would be able to recognize which users are responsible for major parts of the workload. Or to tell which are the most popular tables. And much more.
Create a Kibana Dashboard
On top of that index, we’ve created a Kibana dashboard with many interesting visualisation: queries count, failed vs. successful queries, table of exceptions ordered by frequency, 75/90/95 percentiles of query duration, etc.
That dashboard helps us see a reliable snapshot of our Impala live performance. That’s a course of action I recommend to apply on every SQL engine you’re using: Impala, Presto, Drill and cloud-based solutions like BigQuery or Athena.
You can build all sorts of visualizations on top of that index, and the more attributes you can add to a query — the better.
So, How Did We Miss the Super-Heavy Scheduled Report?
The answer is simple: we didn’t make the connection between the BI system logs (user-side) and the Impala logs (infrastructure-side). That is, we didn’t see on our side which query belonged to which scheduled-report on that system. And since most of those reports need multiple queries (often dozens), we couldn’t notice that what caused the heavy workload, was a single report.
We learned our lesson, and soon enough the developers of the BI system injected the report_id to each query, as a comment in the beginning of the SQL statement. In our queries-to-Elastic script, we added a simple logic that parses that comment, and now we have a report_id attribute in every Impala query that is generated by that system.
Of course the next step was creating a suitable “Queries per ReportID” and “HDFS Bytes Read per ReportID” visualizations to our dashboard. That way, from now on, we will be able to tell if a user decides to create a heavy scheduled-report and stop it on time. I will also mention that the developers of that BI system are working on smart ways to prevent such an abuse in the first place.
Measure your users, defend your infrastructure. Remember that the more users you have, the higher the chances things will go wrong — and when they will, the last thing you want is to be in the dark. BI is important in every business, even in the business of administrating the infrastructure of your BI analysts.