Presto® on Apache Kafka® At Uber Scale

Yang Yang, Yupeng Fu, Hitarth Trivedi

Uber

•

Yang Yang, Yupeng Fu, Hitarth Trivedi

•10 min read•advanced•

--

•View Original

ApacheApache KafkaElasticsearchMySQLOracleRedisSQL

Overview

The article discusses how Uber integrates Presto® and Apache Kafka® to enhance its big data analytics capabilities. It highlights the architecture, challenges, and improvements made to enable lightweight, interactive SQL queries directly over Kafka at Uber scale.

What You'll Learn

1

How to connect Presto with Kafka for real-time analytics

2

Why dynamic schema discovery is crucial for data analytics

3

When to use Presto for querying Kafka streams

Prerequisites & Requirements

Understanding of big data concepts and SQL querying
Familiarity with Presto and Apache Kafka(optional)

Key Questions Answered

How does Uber utilize Presto and Kafka for data analytics?

Uber uses Presto to query various data sources, including Kafka, enabling interactive and near-real-time data analysis. With around 7,000 active users running 500,000 queries daily, Presto plays a critical role in their big data stack, allowing for efficient data-driven decision-making.

What challenges does Uber face with the Presto-Kafka integration?

Uber encounters several challenges, including static Kafka topic discovery, data schema retrieval, and the need for query restrictions to prevent excessive data consumption. These issues necessitate improvements to the Presto-Kafka connector to meet Uber's scalability requirements.

What improvements were made to the Presto Kafka connector?

Improvements include enabling on-demand cluster and topic discovery, implementing query filters to enforce data retrieval constraints, and establishing quota control to manage Kafka cluster consumption rates. These enhancements aim to optimize performance and reliability.

Key Statistics & Figures

Active Presto users

7,000

These users run approximately 500,000 queries daily, highlighting the scale at which Presto operates within Uber.

Data read from HDFS

50 PB

This volume of data is accessed daily through Presto queries, showcasing its capacity to handle large-scale data analytics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Presto

Used for querying various data sources including Kafka.

Backend

Apache Kafka

Serves as the backbone for data streaming at Uber.

Database

Apache Hive

One of the data sources queried by Presto.

Database

Apache Pinot

Used for real-time analytics applications at Uber.

Key Actionable Insights

1
Implement dynamic schema discovery for your Kafka integration to enhance flexibility.
Dynamic schema discovery allows for real-time updates and reduces the need for manual intervention when new topics are created, making your data analytics more agile.

2
Utilize query filters to limit data consumption in your analytics queries.
By enforcing query filters, you can prevent large queries from overwhelming your Kafka cluster, ensuring better performance and user experience.

3
Consider using Presto for ad-hoc querying over Kafka streams.
Presto's ability to perform interactive SQL queries over Kafka allows for quick insights, which is essential for operational teams needing to troubleshoot issues rapidly.

Common Pitfalls

1

Failing to implement query filters can lead to performance degradation.

Without proper filters, large queries may consume excessive resources, negatively impacting the overall health of the Kafka cluster and user experience.

2

Static topic discovery can hinder the agility of data analytics.

If new Kafka topics require a connector restart for discovery, it can slow down the analytics process and reduce responsiveness to business needs.

Related Concepts

Big Data Analytics

Real-time Data Processing

SQL Querying Techniques

Data Integration Strategies