Overview
The article discusses Uber's experience operating Apache Pinot at scale, detailing its role in enabling real-time analytics across various use cases. It highlights the architecture, ingestion processes, and the lessons learned from scaling Pinot to handle terabyte-scale data with millisecond latencies.
What You'll Learn
1
How to implement real-time analytics using Apache Pinot
2
Why multi-tenancy is crucial for scaling data platforms
3
How to optimize data ingestion processes for low latency
Prerequisites & Requirements
- Understanding of OLAP systems and real-time data processing
- Familiarity with Apache Kafka and HDFS(optional)
Key Questions Answered
How does Uber scale Apache Pinot for real-time analytics?
Uber scales Apache Pinot by deploying a multi-cluster architecture that supports hundreds of use cases for querying terabyte-scale data with millisecond latencies. The platform evolved from a small ten-node cluster to hundreds of nodes, managing tens of terabytes of data and achieving thousands of queries per second in production.
What are the main use cases for Apache Pinot at Uber?
The main use cases for Apache Pinot at Uber include building customized dashboards for products like Uber Eats, executing analytical queries in backend services, and enabling near-real-time exploration of data for operational teams. These use cases help in monitoring demand-supply metrics and identifying anomalies.
What contributions has Uber made to Apache Pinot?
Uber has made several contributions to Apache Pinot, including enhancing self-service onboarding processes, integrating full SQL support through Presto, and improving the deep store functionality for segment management. These contributions aim to increase reliability and query flexibility.
What are common pitfalls when using Apache Pinot at scale?
Common pitfalls include memory overhead due to large scans and excessive segments per server, which can lead to garbage collection pauses and performance degradation. Uber addresses these issues by isolating ad hoc querying use cases to separate tenants and implementing message throttling.
Key Statistics & Figures
Total data footprint managed by Pinot
tens of terabytes
This reflects the scale at which Uber operates Pinot, having increased from tens of gigabytes in its early days.
Queries per second (QPS) in production
thousands of QPS
This demonstrates the performance capabilities of the Pinot platform as it scales to meet Uber's operational demands.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Apache Pinot
Used for low-latency analytical queries on large datasets.
Streaming
Apache Kafka
Serves as the data source for real-time ingestion into Pinot.
Storage
Hdfs
Used as a deep store for archiving Pinot segments.
Query Engine
Presto
Integrated with Pinot to enable full SQL support for querying data.
Key Actionable Insights
1Implement multi-tenancy in your data architecture to isolate workloads and improve performance.By logically grouping tables under tenant names, you can mitigate the impact of resource-heavy queries on other workloads, ensuring better service level agreements (SLAs) across different applications.
2Utilize Apache Kafka for real-time data ingestion to enhance the freshness of analytics.Real-time ingestion from Kafka allows for immediate data availability, which is crucial for applications that require up-to-date insights, such as monitoring user demand and operational metrics.
3Regularly monitor and optimize segment management to prevent performance bottlenecks.As data scales, managing the number of segments per server becomes critical. Implementing strategies like message throttling can help maintain performance and avoid crashes during state transitions.
Common Pitfalls
1
Overloading Pinot servers with large scans can lead to performance issues.
When users execute broad queries without time predicates, it can cause excessive memory usage and garbage collection pauses, impacting overall system performance.
2
Too many segments per server can overwhelm the Pinot cluster.
Having an excessive number of segments can lead to spikes in state transition messages, potentially crashing servers and controllers. Implementing message throttling can help mitigate this risk.
Related Concepts
Real-time Analytics
Olap Systems
Data Ingestion Strategies
Multi-tenancy In Databases