Enabling Infinite Retention for Upsert Tables in Apache Pinot

Pratik Tibrewal
10 min readadvanced
--
View Original

Overview

The article discusses recent developments in Apache Pinot that enable infinite retention for upsert tables, focusing on the implementation of deletions at both memory and disk levels. It highlights how these advancements allow Uber to efficiently manage large datasets while maintaining performance.

What You'll Learn

1

How to implement point deletes in Apache Pinot for efficient data management

2

Why metadata retention is crucial for managing deleted keys in upsert tables

3

How to perform upsert compaction to optimize storage in Apache Pinot

Prerequisites & Requirements

  • Understanding of Apache Pinot and OLAP databases
  • Familiarity with data retention policies and upsert operations(optional)

Key Questions Answered

How does Apache Pinot enable infinite retention for upsert tables?
Apache Pinot enables infinite retention for upsert tables by allowing deletions at both memory and disk levels. This feature ensures that records can be efficiently updated or deleted based on specific business needs, thus managing large datasets sustainably.
What are point deletes in Apache Pinot and how are they configured?
Point deletes in Apache Pinot allow users to mark records as deleted, ensuring they are not returned in subsequent queries. This feature can be enabled through table-level configuration, allowing for efficient data management without increasing memory usage.
What is the impact of enabling metadata retention on deleted keys?
Enabling metadata retention on deleted keys allows Apache Pinot to remove metadata after a specified TTL window. This helps prevent the reappearance of deleted records due to out-of-order events, ensuring data consistency and efficient memory usage.
How does upsert compaction work in Apache Pinot?
Upsert compaction in Apache Pinot involves merging segments to remove stale or deleted rows from disk. This process optimizes storage and improves query performance by reducing the number of segments that need to be processed during server restarts.

Key Statistics & Figures

Daily deletion rate
600 million keys
This statistic illustrates the scale at which Uber operates with deleted keys in their upsert tables.
Memory utilization for upsert use cases
80%
This shows the high memory demands associated with upsert operations at Uber.
Disk utilization for non-upsert use cases
80%
This comparison highlights the efficiency differences between upsert and append-only operations.

Technologies & Tools

Database
Apache Pinot
Used for managing upsert operations and enabling infinite retention.

Key Actionable Insights

1
Implement point deletes to enhance data management in upsert tables.
This allows for efficient handling of records that need to be removed without impacting the overall performance of the database.
2
Utilize metadata retention features to maintain data consistency.
By setting a TTL for deleted keys, you can ensure that any out-of-order events do not reverse deletions, which is crucial for maintaining accurate datasets.
3
Regularly perform upsert compaction to optimize storage usage.
This helps in managing disk space effectively, especially in environments with high deletion rates, thus preventing out-of-memory issues.

Common Pitfalls

1
Failing to manage the metadata of deleted records can lead to increased memory usage and potential out-of-memory errors.
This happens when deleted records are not properly handled, leading to stale data remaining in memory and disk, which can impact performance.

Related Concepts

Upsert Operations In Databases
Data Retention Policies
Memory Management In Olap Systems