Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber

Nishith Agarwal, Kaushik Devarajaiah
12 min readintermediate
--
View Original

Overview

The article discusses Uber's approach to consistent data partitioning using a Global Index for managing large Apache Hadoop tables. It highlights the challenges faced in data ingestion and querying at scale, and how Uber's Big Data platform leverages open source technologies to enhance data accessibility and reliability.

What You'll Learn

1

How to implement a Global Index for efficient data lookup in Hadoop tables

2

Why strong consistency is crucial for data ingestion processes

3

How to optimize HBase for high throughput during data ingestion

Prerequisites & Requirements

  • Understanding of Hadoop and HBase architecture
  • Familiarity with Apache Spark for data processing(optional)

Key Questions Answered

How does Uber manage data ingestion for large datasets?
Uber manages data ingestion by classifying data into append-only and append-plus-update types, utilizing a Global Index for efficient data location and updates. This approach allows for high throughput during ingestion, especially during the bootstrap phase, where millions of requests per second are needed.
What are the challenges of using open source key-value stores for data indexing?
Open source key-value stores often struggle with scalability, either compromising on throughput or correctness. Uber found that these systems could not meet their demands for millions of operations per second, prompting the development of their Global Index to ensure strong consistency and high throughput.
What is the role of HFiles in Uber's data ingestion process?
HFiles are used to store index information in HBase's internal format. They facilitate the efficient upload of indexes during the bootstrap phase, allowing for quick ingestion of large datasets by ensuring data is sorted and organized according to HBase's requirements.
How does Uber handle throttling during HBase access?
Uber implements throttling to manage load peaks during the incremental ingestion phase by controlling cumulative writes per second to HBase regional servers. This ensures that the system remains stable and responsive, even under high demand.

Key Statistics & Figures

Throughput during bootstrap phase
millions of requests per second
This level of throughput is necessary to ingest large amounts of data quickly when onboarding new datasets.
HFile upload time for large datasets
less than an hour
This efficiency is achieved by pre-splitting HBase tables to match the number of HFiles being uploaded.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Hbase
Used for storing index information and facilitating fast data lookups.
Data Processing
Apache Spark
Utilized for generating HFiles and processing large datasets in a distributed manner.

Key Actionable Insights

1
Implement a Global Index to enhance data lookup efficiency in large datasets.
This approach can significantly reduce the time taken to locate and update records, especially in environments with high data ingestion rates.
2
Utilize HFiles for bulk uploading indexes to HBase to improve ingestion speed.
By generating HFiles in the correct format and pre-splitting HBase tables, you can minimize upload latency and ensure faster data processing.
3
Monitor and adjust throttling parameters based on HBase server load to maintain performance.
This proactive management helps avoid bottlenecks during peak ingestion times, ensuring a smooth data flow and system reliability.

Common Pitfalls

1
Assuming uniform distribution of indexes across HBase regional servers can lead to performance issues.
If certain datasets have fewer indexes, they may receive less QPS, leading to slower performance. Adjusting QPSFraction can help mitigate this.

Related Concepts

Data Partitioning Strategies
Indexing Techniques In Distributed Systems
Hadoop Ecosystem Components