Building scalable near-real time indexing on HBase

Pinterest Engineering
11 min readadvanced
--
View Original

Overview

The article discusses the development of Ixia, a scalable near-real-time secondary indexing solution built on HBase at Pinterest. It addresses the challenges of advanced indexing and querying in HBase and outlines the architecture, data consistency, disaster recovery, and performance optimization strategies employed in Ixia.

What You'll Learn

1

How to implement near-real-time secondary indexing on HBase using Ixia

2

Why using a custom replication sink server enhances data consistency in HBase

3

How to optimize caching strategies to improve read performance in high read/write scenarios

Prerequisites & Requirements

  • Understanding of HBase and NoSQL databases
  • Familiarity with Kafka and distributed systems(optional)

Key Questions Answered

How does Ixia ensure data consistency in HBase?
Ixia maintains data consistency by using a Change Data Capture (CDC) framework that guarantees all Write Ahead Log (WAL) events are published to a replication sink service. This ensures that the index is updated only after a successful write request to the database, thus providing strong consistency for Get requests.
What caching strategy does Ixia use to optimize performance?
Ixia employs a look-aside caching strategy, where read requests are first checked in the cache. If a cache miss occurs, the data is fetched from HBase. This approach has resulted in a cache hit rate of around 90%, significantly reducing the load on HBase.
What is the performance of Ixia in terms of query handling?
Ixia can handle approximately 40,000 queries per second (QPS) with a P99 latency of about 5 milliseconds. The system is designed to scale horizontally, allowing it to manage high traffic efficiently while maintaining a 99.99% availability SLA.
How does Ixia handle disaster recovery?
Ixia is backed by two HBase clusters for fault tolerance, with an active cluster serving online traffic and a standby cluster for replication. In case of failure, zero-downtime failover mechanisms ensure continuous availability, along with periodic HBase snapshots for point-in-time recovery.

Key Statistics & Figures

Cache hit rate
90%
Achieved after implementing a look-aside caching strategy in Ixia.
Queries per second (QPS)
40,000
One of the production clusters serves this volume with a P99 latency of ~5ms.
Maximum throughput
250,000 QPS
The peak performance of the entire Ixia system.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Hbase
Serves as the source-of-truth database for Ixia.
Messaging
Kafka
Used for publishing WAL events and triggering business logic.
Search Engine
Muse
Provides rich search functionalities for Ixia.
Caching
Memcached
Used in Ixia's distributed caching infrastructure.
Caching
Mcrouter
Supports the caching infrastructure at Pinterest.

Key Actionable Insights

1
Implement a CDC framework to enhance data consistency in distributed systems.
Using a CDC framework like the one in Ixia can help ensure that all changes to the database are captured and reflected in secondary indexes, thus maintaining strong consistency.
2
Adopt a look-aside caching strategy to improve read performance in applications with high read/write ratios.
This strategy allows for quick access to frequently requested data, reducing the load on the primary database and improving overall application responsiveness.
3
Utilize horizontal scaling to manage increasing query loads effectively.
By designing systems to scale horizontally, as demonstrated in Ixia, organizations can handle higher traffic without compromising performance or availability.

Common Pitfalls

1
Overlooking the complexity of maintaining secondary indexes in a distributed system can lead to performance bottlenecks.
As the number of indexes grows, the system can become more complex, increasing the operational load and making it harder to maintain performance.
2
Failing to implement proper caching strategies can result in high latency and increased load on the primary database.
Without an effective caching strategy, applications may experience slow response times and higher operational costs due to excessive database queries.

Related Concepts

Distributed Systems
Change Data Capture (cdc)
Caching Strategies
Secondary Indexing In Nosql Databases