Overview
The article discusses how Uber integrates Presto® and Apache Kafka® to enhance its big data analytics capabilities. It highlights the architecture, challenges, and improvements made to enable lightweight, interactive SQL queries directly over Kafka at Uber scale.
What You'll Learn
1
How to connect Presto with Kafka for real-time analytics
2
Why dynamic schema discovery is crucial for data analytics
3
When to use Presto for querying Kafka streams
Prerequisites & Requirements
- Understanding of big data concepts and SQL querying
- Familiarity with Presto and Apache Kafka(optional)
Key Questions Answered
How does Uber utilize Presto and Kafka for data analytics?
Uber uses Presto to query various data sources, including Kafka, enabling interactive and near-real-time data analysis. With around 7,000 active users running 500,000 queries daily, Presto plays a critical role in their big data stack, allowing for efficient data-driven decision-making.
What challenges does Uber face with the Presto-Kafka integration?
Uber encounters several challenges, including static Kafka topic discovery, data schema retrieval, and the need for query restrictions to prevent excessive data consumption. These issues necessitate improvements to the Presto-Kafka connector to meet Uber's scalability requirements.
What improvements were made to the Presto Kafka connector?
Improvements include enabling on-demand cluster and topic discovery, implementing query filters to enforce data retrieval constraints, and establishing quota control to manage Kafka cluster consumption rates. These enhancements aim to optimize performance and reliability.
Key Statistics & Figures
Active Presto users
7,000
These users run approximately 500,000 queries daily, highlighting the scale at which Presto operates within Uber.
Data read from HDFS
50 PB
This volume of data is accessed daily through Presto queries, showcasing its capacity to handle large-scale data analytics.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Presto
Used for querying various data sources including Kafka.
Backend
Apache Kafka
Serves as the backbone for data streaming at Uber.
Database
Apache Hive
One of the data sources queried by Presto.
Database
Apache Pinot
Used for real-time analytics applications at Uber.
Key Actionable Insights
1Implement dynamic schema discovery for your Kafka integration to enhance flexibility.Dynamic schema discovery allows for real-time updates and reduces the need for manual intervention when new topics are created, making your data analytics more agile.
2Utilize query filters to limit data consumption in your analytics queries.By enforcing query filters, you can prevent large queries from overwhelming your Kafka cluster, ensuring better performance and user experience.
3Consider using Presto for ad-hoc querying over Kafka streams.Presto's ability to perform interactive SQL queries over Kafka allows for quick insights, which is essential for operational teams needing to troubleshoot issues rapidly.
Common Pitfalls
1
Failing to implement query filters can lead to performance degradation.
Without proper filters, large queries may consume excessive resources, negatively impacting the overall health of the Kafka cluster and user experience.
2
Static topic discovery can hinder the agility of data analytics.
If new Kafka topics require a connector restart for discovery, it can slow down the analytics process and reduce responsiveness to business needs.
Related Concepts
Big Data Analytics
Real-time Data Processing
SQL Querying Techniques
Data Integration Strategies