Blazing Fast OLAP on Uber’s Inventory and Catalog Data with Apache Pinot™

Suraj Modi, Ankit Sultana, Tarun Mavani

Uber

•

Suraj Modi, Ankit Sultana, Tarun Mavani

•11 min read•intermediate•

--

•View Original

ApacheApache KafkaJavaMySQLOracle

Overview

This article discusses Uber's implementation of Apache Pinot to manage and analyze its extensive inventory and catalog data efficiently. It highlights the challenges faced in real-time data management and the solutions provided by Pinot, including ingestion architecture, performance optimizations, and key features that enhance query responsiveness.

What You'll Learn

1

How to implement real-time data ingestion with Apache Pinot

2

Why using Bloom filters can enhance query performance in large datasets

3

When to apply upsert compaction techniques for efficient data management

Prerequisites & Requirements

Understanding of OLAP systems and real-time data processing
Familiarity with Apache Pinot and Kafka(optional)

Key Questions Answered

How does Uber manage its massive catalog data with Apache Pinot?

Uber uses Apache Pinot to handle its extensive catalog data by implementing a real-time ingestion architecture that allows for low-latency queries and efficient data management. This system supports billions of rows and hundreds of thousands of updates per second, ensuring data freshness and responsiveness.

What are the benefits of using Apache Pinot for Uber's inventory system?

The benefits of using Apache Pinot include reduced query latency, with updates appearing within 5-10 minutes and queries returning in 1-3 seconds, even across billions of rows. This setup also enhances resilience and allows for easy integration of new attributes.

What optimizations were made to improve performance in Apache Pinot?

Performance improvements included the adoption of Java 17, which reduced garbage collection latencies significantly, and the implementation of a Small Segment Merger task that decreased segment counts and improved query latencies by up to 75%.

Key Statistics & Figures

Query latency improvement

75%

p99 query latency improved from 1150ms to 269ms after implementing the Small Segment Merger.

Table size reduction

40%

Peak table size reduced from 42TB to 24TB following the implementation of the Small Segment Merger.

Segment count reduction

70%

Segment count decreased from 74,000 to 22,000 due to the new compaction and merging strategies.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Apache Pinot

Used for real-time analytics and managing Uber's catalog data.

Message Broker

Apache Kafka

Facilitates the ingestion of events into Pinot.

Search Library

Apache Lucene

Powers text search capabilities within Pinot.

Programming Language

Java

Used for implementing performance improvements and optimizations.

Key Actionable Insights

1
Implement real-time ingestion pipelines using Apache Pinot to enhance data freshness and query responsiveness.
This approach is particularly beneficial for applications requiring immediate access to updated data, such as inventory management systems.

2
Utilize Bloom filters for specific item lookups to optimize query performance and reduce CPU usage.
This technique is essential when dealing with large datasets where quick access to specific records is necessary.

3
Adopt upsert compaction strategies to manage data efficiently and maintain high retention rates.
This is crucial for systems that require continuous data updates without losing historical records.

Common Pitfalls

1

Failing to optimize query performance can lead to significant latency issues.

Without proper indexing and query optimization strategies, systems can become slow and unresponsive, especially under heavy load.

2

Neglecting data freshness in real-time systems can result in outdated information being presented to users.

It's crucial to ensure that data ingestion processes are efficient and timely to maintain user trust and system reliability.

Related Concepts

Real-time Data Processing

Olap Systems

Data Ingestion Architectures

Performance Optimization Techniques