How Uber Indexes Streaming Data with Pull-Based Ingestion in OpenSearch™

Yupeng Fu, Varun Bharadwaj, Shuyi Zhang, Xu Xiong, Michael Froh
14 min readadvanced
--
View Original

Overview

This article discusses how Uber utilizes a pull-based ingestion model in OpenSearch™ to effectively index streaming data. It highlights the architecture's benefits, including improved reliability and scalability, and details the implementation of this model within Uber's search platform.

What You'll Learn

1

How to implement a pull-based ingestion model in OpenSearch

2

Why pull-based ingestion improves data reliability and scalability

3

How to manage out-of-order data in streaming ingestion

Prerequisites & Requirements

  • Understanding of streaming data concepts and OpenSearch architecture
  • Familiarity with Apache Kafka and OpenSearch(optional)

Key Questions Answered

How does Uber's pull-based ingestion model enhance data indexing?
Uber's pull-based ingestion model enhances data indexing by decoupling data producers from the search cluster, allowing for greater reliability and control over data flow. This model mitigates issues like ingestion spikes and prioritizes critical updates, ensuring that data is indexed efficiently and consistently across regions.
What are the main components of the pull-based ingestion architecture in OpenSearch?
The pull-based ingestion architecture in OpenSearch includes components like the IngestionPlugin interface, StreamPoller for consuming data from Kafka or Kinesis, and the IngestionEngine that processes indexing requests. This setup allows for efficient data handling and supports high throughput in streaming environments.
What challenges does the pull-based model address compared to push-based systems?
The pull-based model addresses challenges such as managing ingestion spikes, lack of priority control for indexing requests, and complex data replay scenarios. By using a durable buffer, it ensures that data is not lost during traffic spikes and simplifies the replay process for data recovery.
How does Uber handle shard recovery in the pull-based ingestion model?
Uber's shard recovery process involves using a BatchStartPointer to track the minimum offset being processed across all writer threads. This ensures that when a primary shard fails, the system can accurately resume ingestion from the last known safe point, preventing data loss and duplicates.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a pull-based ingestion model can significantly enhance the reliability of your data indexing process.
This model allows for better handling of data spikes and ensures that critical updates are prioritized, which is essential for applications requiring real-time data processing.
2
Utilizing Apache Kafka as a streaming source can simplify the management of data ingestion and recovery.
Kafka's durable message storage allows for efficient replay of messages, which is crucial for maintaining data consistency during shard recovery.
3
Incorporating versioning in your data ingestion pipeline can help manage out-of-order events effectively.
This ensures that the most recent updates are preserved and prevents older updates from overwriting newer ones, maintaining data integrity.

Common Pitfalls

1
Assuming that a push-based ingestion model is sufficient for high-volume data environments can lead to data loss and increased complexity.
Push-based systems often struggle with backpressure during traffic spikes, leading to rejected requests and operational challenges. Transitioning to a pull-based model can alleviate these issues.
2
Neglecting to implement proper versioning can result in data inconsistency when handling out-of-order events.
Without versioning, older updates may overwrite newer ones, which can compromise data integrity. Implementing external versioning is crucial for maintaining a consistent view of documents.

Related Concepts

Streaming Data Architecture
Data Ingestion Strategies
Opensearch Capabilities
Apache Kafka Integration