Building a Better Big Data Architecture: Meet Uber’s Presto Team

Wayne Cunningham

Uber

•

Wayne Cunningham

•11 min read•advanced•

--

•View Original

ApacheElasticsearchJavaSQL

Overview

The article discusses Uber's adoption of Presto, an open-source SQL query engine, to enhance its big data architecture. It highlights the team's contributions to the Presto community, the engine's capabilities, and its role in improving data accessibility for Uber's operations.

What You'll Learn

1

How to optimize SQL queries for better performance in Presto

2

Why Presto is a suitable choice for querying heterogeneous data sources

3

When to use Presto for real-time analytics in large-scale applications

Prerequisites & Requirements

Understanding of SQL and data analytics concepts
Familiarity with big data technologies like Apache Pinot and Elasticsearch(optional)

Key Questions Answered

What is Presto and how does it benefit Uber's data architecture?

Presto is an open-source SQL query engine that allows Uber to run queries across diverse data sources efficiently. It supports over a thousand nodes and handles about 400,000 queries per day, enabling data-driven decision-making and operational improvements.

How does Uber contribute to the Presto community?

Uber engineers actively contribute to Presto by developing database connectors and other enhancements. They have also joined The Presto Foundation as founding members to advance SQL query technology under the Linux Foundation.

What are the advantages of using Presto over other SQL engines?

Presto is designed for high-query throughput and operates entirely in memory, making it faster than many other engines. Its extensibility allows it to integrate with various data formats and storage systems, which is crucial for Uber's diverse data needs.

What challenges does Uber face with Presto and how are they addressed?

Uber's challenges with Presto include optimizing SQL queries that run out of memory. Engineers refactor queries to improve performance, ensuring they can handle large datasets efficiently without exceeding memory limits.

Key Statistics & Figures

Daily queries handled by Presto at Uber

400,000

This statistic highlights the scale at which Presto operates within Uber's data architecture.

Number of nodes in Uber's Presto cluster

over a thousand

This demonstrates the robustness and scalability of Uber's Presto implementation.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Presto

Used as a SQL query engine to access and analyze large datasets across various sources.

Backend

Apache Pinot

Utilized for real-time analytics in conjunction with Presto.

Backend

Elasticsearch

Used for storing critical business data at Uber.

Storage

Hdfs

Integrated with Presto for data storage and retrieval.

Key Actionable Insights

1
Optimize SQL queries by refactoring them to minimize memory usage, especially for large datasets.
This approach is essential when dealing with complex queries that risk running out of memory, as demonstrated by Uber's experience with document audits.

2
Leverage Presto's extensibility to integrate with various data sources, enhancing data accessibility.
By using Presto, teams can query data from multiple systems without needing to switch platforms, streamlining data analysis processes.

3
Engage with the Presto community to stay updated on best practices and contribute to ongoing improvements.
Active participation in the community can lead to better support and shared knowledge, which is beneficial for optimizing Presto's use in large organizations.

Common Pitfalls

1

Failing to optimize SQL queries can lead to performance issues, such as running out of memory.

This often occurs when queries are not structured efficiently, causing excessive data movement in memory. Regularly reviewing and refactoring queries is essential to maintain performance.

Related Concepts

Big Data Architecture

SQL Query Optimization

Open Source Contributions

Real-time Data Analytics