Supercharging Apache Superset

How Airbnb customized Superset for business intelligence at scale

Erik Ritter
10 min readintermediate
--
View Original

Overview

The article discusses how Airbnb has enhanced Apache Superset, an open-source data exploration and visualization platform, to meet its enterprise-level business intelligence needs. It highlights the custom features developed, scaling metrics, and the collaborative efforts within the open-source community to optimize Superset for large-scale data analysis.

What You'll Learn

1

How to implement a cache warmup strategy using Apache Airflow

2

Why domain sharding is essential for improving dashboard performance

3

How to manage database engine load effectively in a BI tool

Prerequisites & Requirements

  • Understanding of business intelligence concepts and data visualization
  • Familiarity with Apache Superset and Apache Airflow(optional)

Key Questions Answered

How does Airbnb scale Apache Superset for enterprise use?
Airbnb scales Apache Superset by implementing custom features such as a cache warmup job, domain sharding, and database engine load management. These enhancements allow for improved performance and support for a large number of users and queries, ensuring that the platform can handle the demands of enterprise-level data analysis.
What performance metrics does Superset handle at Airbnb?
Superset at Airbnb supports approximately 2,000 users, processes around 50,000 SQL Lab queries weekly, and manages 6,000 to 125,000 dashboard and chart views. This scale demonstrates its capability to serve a significant number of data-driven insights across the organization.
What is the role of the Metric Explorer in Superset?
The Metric Explorer is a component of Airbnb's Dataportal that allows teams to explore curated business metrics easily. It leverages the Minerva framework for data consistency and provides a user-friendly interface for accessing key metrics without reimplementing Superset's dashboarding features.
How does Airbnb integrate data access policies in Superset?
Airbnb customizes Superset's Security Manager to enforce data access policies at the table or metric level. This integration allows users to request access to data directly within Superset, streamlining the process and enhancing user experience while maintaining security compliance.

Key Statistics & Figures

Weekly SQL Lab queries
50,000
This figure represents the volume of SQL queries processed by Superset at Airbnb each week.
Dashboard and chart views
6,000 to 125,000
This range indicates the number of views Superset handles, showcasing its capacity for user engagement.
Number of users
2,000
This number reflects the active user base utilizing Superset for data analysis at Airbnb.
Tables and virtual datasets
100,000
Superset at Airbnb supports over 100,000 tables and virtual datasets, enabling extensive data exploration.
Charts and dashboards
200,000 charts and 14,000 dashboards
This statistic highlights the scale of visualizations created and maintained within Superset.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Frontend
Apache Superset
Used as the primary BI tool for data exploration and visualization.
Backend
Apache Airflow
Implemented for managing the cache warmup job to improve performance.
Backend
Redis
Utilized for caching chart requests to enhance load times.
Backend
Apache Druid
Used as a data source for Superset, supporting complex queries and analytics.

Key Actionable Insights

1
Implement a cache warmup job to enhance performance in data visualization tools.
By preloading frequently accessed dashboards during off-peak hours, you can significantly reduce load times for users, leading to a smoother experience and increased productivity.
2
Utilize domain sharding to manage concurrent requests effectively.
By routing requests through multiple subdomains, you can bypass browser limitations on concurrent connections, allowing for faster loading of complex dashboards with many charts.
3
Establish a robust database engine load management strategy.
By routing queries based on importance and limiting user concurrency, you can prevent database overload and ensure that critical queries are prioritized, maintaining system stability.

Common Pitfalls

1
Failing to implement effective caching strategies can lead to slow dashboard load times.
Without a cache warmup job, users may experience delays when accessing frequently viewed dashboards, which can hinder productivity and user satisfaction.
2
Overloading the database engine with too many concurrent queries can degrade performance.
If user concurrency is not managed properly, it can lead to resource contention and slow query execution times, impacting the overall user experience.

Related Concepts

Data Visualization Best Practices
Business Intelligence Tools Comparison
Scaling Open-source Software For Enterprise Use