Realtime and batch analytics at Airbnb and the role Druid plays in our analytics system architecture
Overview
The article discusses how Airbnb utilizes Druid, a big data analytics engine, to enhance their analytics capabilities for real-time and batch processing. It highlights the architecture, advantages, and specific implementations of Druid within Airbnb's data infrastructure.
What You'll Learn
1
How to leverage Druid for fast analytics queries
2
Why Druid's architecture is beneficial for scalability and reliability
3
How to implement real-time data ingestion using Spark Streaming and Tranquility
4
When to use Druid for ad-hoc analytics queries
Prerequisites & Requirements
- Understanding of data analytics and big data concepts
- Familiarity with Hadoop, Kafka, and Spark Streaming(optional)
Key Questions Answered
How does Druid improve query performance compared to Hive and Presto?
Druid offers sub-second query latency with predefined data sources and pre-computed aggregations, making it significantly faster than Hive and Presto, which can be an order of magnitude slower due to their on-demand aggregation processes.
What architecture does Druid use to ensure reliability and scalability?
Druid's architecture separates components for ingestion, serving, and coordination, which enhances reliability and allows for easy scaling. This includes deep storage for long-term data and caching in historical nodes, facilitating disaster recovery and hardware upgrades.
What is the dual cluster configuration used by Airbnb for Druid?
Airbnb operates two Druid clusters: one for centralized critical metrics services and another for real-time data ingestion. This setup allows for dedicated support for different use cases while optimizing resource management and performance.
How does Airbnb handle backfill performance in Druid?
Airbnb has implemented a solution that keeps newly ingested segments inactive until explicitly activated, allowing for parallel ingestion of smaller intervals. This approach improves backfill performance significantly, reducing completion times from over a day to just one hour.
Key Statistics & Figures
Number of Druid nodes at Airbnb
4 Brokers, 2 Overlords, 2 Coordinators, 8 Middle Managers, and 40 Historical nodes
This configuration supports the dual cluster setup for handling different analytics workloads.
Backfill performance improvement
Reduced from longer than a day to one hour
This significant enhancement was achieved by activating newly ingested segments only when ready, preventing inconsistencies during data updates.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Druid
Used for real-time and batch analytics at Airbnb.
Backend
Spark Streaming
Used for real-time data ingestion into Druid.
Frontend
Apache Superset
Serves as the interface for users to compose and execute analytics queries on Druid.
Tools
Airflow
Used for scheduling batch jobs that ingest data from Hadoop.
Key Actionable Insights
1Implement Druid for real-time analytics to enhance decision-making capabilities.Druid's ability to process real-time data quickly allows teams to make informed decisions based on the latest information, which is crucial for dynamic environments like Airbnb.
2Utilize Druid's architecture for scalable data solutions.By leveraging Druid's componentized architecture, organizations can ensure their analytics systems remain reliable and can scale efficiently as data volume and user demand grow.
3Adopt a dual cluster configuration for specialized analytics needs.Having separate clusters for different analytics tasks can optimize performance and resource allocation, ensuring that critical metrics services do not interfere with real-time data processing.
Common Pitfalls
1
Failing to manage the number of segment files can lead to ingestion delays.
As the number of segments increases, the coordinator may struggle to keep up, causing significant delays in data availability for querying. To avoid this, consider optimizing segment sizes and managing ingestion workflows effectively.
Related Concepts
Big Data Analytics
Data Ingestion Frameworks
Real-time Data Processing
Data Visualization Tools