How Uber Accomplishes Job Counting  At Scale

Ryan Woo, Sameer Kapoor
11 min readadvanced
--
View Original

Overview

This article discusses how Uber counts job participation at scale, detailing the integration of Apache Pinot™ to address challenges in data processing and analysis. It highlights the complexities involved in handling large datasets and the architectural decisions made to optimize performance.

What You'll Learn

1

How to count job participation effectively using Apache Pinot™

2

Why data retention policies impact data accessibility in large systems

3

How to optimize query performance in distributed databases

Prerequisites & Requirements

  • Understanding of distributed databases and data retention policies
  • Familiarity with Apache Pinot™ and Apache Hive™(optional)

Key Questions Answered

How does Uber count job participation at scale?
Uber counts job participation by integrating Apache Pinot™ to analyze large datasets efficiently. This allows them to derive insights from over 2.2 billion trips every quarter, addressing challenges like data retention and query performance.
What challenges did Uber face when implementing job counting?
Uber faced several challenges, including capacity planning, query performance, slow data arrival, and handling bursty upstream loads. Each challenge required specific strategies, such as optimizing segment sizes and implementing caching mechanisms.
What architectural decisions were made to support job counting?
The architecture involved using a hybrid table in Apache Pinot™ that combines real-time and offline data, allowing for efficient querying and analysis. This design choice was crucial for handling the scale of Uber's operations.
How did Uber optimize query performance in Apache Pinot™?
Uber optimized query performance by implementing sorted columns, inverted indices, and Bloom filters. These techniques significantly reduced the number of segments queried and improved overall read times.

Key Statistics & Figures

Number of trips facilitated per quarter
2.2 billion
This statistic illustrates the scale at which Uber operates and the need for efficient data processing solutions.
p99 read latency
~1s
This performance metric indicates the efficiency of the Apache Pinot™ solution after optimizations were implemented.

Technologies & Tools

Database
Apache Pinot™
Used for real-time analytics and job counting at scale.
Data Warehousing
Apache Hive™
Used for managing large datasets and facilitating data processing.
Stream Processing
Apache Kafka™
Used for handling real-time data streams.
Data Processing
Apache Spark™
Used for creating and uploading new segments in Apache Pinot™.

Key Actionable Insights

1
Implementing a hybrid table in Apache Pinot™ can significantly improve data accessibility and query performance.
This approach allows for seamless integration of real-time and offline data, which is essential for applications requiring immediate insights.
2
Using Bloom filters can drastically reduce the number of segments processed during queries.
By enabling Bloom filters, Uber was able to skip unnecessary segments, leading to faster query responses and reduced resource consumption.
3
Regularly review and adjust segment sizes in your database to optimize read performance.
As data grows, adjusting segment sizes can help maintain efficient read times and prevent performance bottlenecks.

Common Pitfalls

1
Underestimating the impact of data retention policies on data accessibility.
Data retention policies can lead to significant challenges in accessing historical data, which may be critical for analysis. It's important to plan for how data will be stored and accessed over time.
2
Failing to optimize query performance before scaling.
Without proper optimizations, scaling up can lead to performance degradation, as seen when Uber's initial queries maxed out read throughput. Regular performance assessments are crucial.

Related Concepts

Distributed Databases
Data Retention Policies
Query Optimization Techniques