Upgrading Pinterest to HBase 1.2 from 0.94

Pinterest Engineering
7 min readintermediate
--
View Original

Overview

The article discusses Pinterest's upgrade of HBase from version 0.94.26 to 1.2, emphasizing the importance of maintaining high performance and availability during the transition. It details the challenges faced, the design of a zero-downtime upgrade system, and the performance improvements achieved post-upgrade.

What You'll Learn

1

How to upgrade HBase clusters without downtime

2

Why bi-directional heterogeneous replication is crucial during upgrades

3

How to optimize HBase performance for low latency

Prerequisites & Requirements

  • Understanding of HBase architecture and replication concepts
  • Familiarity with Thrift for replication(optional)
  • Experience with performance tuning in distributed systems

Key Questions Answered

What challenges did Pinterest face during the HBase upgrade?
Pinterest faced significant challenges during the upgrade from HBase 0.94.26 to 1.2 due to incompatible changes requiring complete shutdowns of clusters and clients. To address this, they developed a system that allows for upgrades and rollbacks without downtime, ensuring continuous service availability.
How did Pinterest ensure data consistency during the upgrade?
Pinterest implemented bi-directional heterogeneous replication between the 0.94 and 1.2 clusters, allowing real-time data synchronization. They also developed a tool called checkr to verify data correctness by comparing rows across clusters, ensuring that no data inconsistency occurred during the transition.
What performance improvements were observed after upgrading to HBase 1.2?
After the upgrade to HBase 1.2, Pinterest measured latency improvements of 124% to 800% across service APIs. The new cluster proved to be more robust and had fewer operational issues compared to the previous version, enhancing overall service performance.
What specific performance tuning strategies were applied during the upgrade?
Pinterest focused on tuning Garbage Collection (GC) settings for low latency and adjusted the WAL sync settings to improve write latency. They discovered that reducing the WAL sync operations significantly enhanced performance, particularly under increased write loads.

Key Statistics & Figures

Data served
10 petabytes
HBase serves this amount of data at over 10 million queries per second (QPS
Performance improvement
124–800 percent
Measured improvements in service API latencies after upgrading to HBase 1.2.
JIRA cases addressed
over 5,000
HBase 1.2 includes more than 5,000 completed JIRA cases compared to version 0.94.26.

Technologies & Tools

Database
Hbase
Used as the primary data storage system for Pinterest's critical services.
Protocol
Thrift
Utilized for replication between HBase clusters.
Client
Asynchbase
An asynchronous HBase client used for compatibility with multiple HBase versions.

Key Actionable Insights

1
Implement a zero-downtime upgrade strategy for critical systems to maintain service availability during transitions.
This approach is essential for businesses that rely on continuous service, as it minimizes disruptions and enhances user experience during upgrades.
2
Utilize bi-directional heterogeneous replication to ensure data consistency across different versions of databases.
This technique allows for real-time synchronization and is crucial when upgrading systems with incompatible changes, reducing the risk of data loss.
3
Invest time in performance tuning, especially for Garbage Collection settings, to achieve low latency in high-demand environments.
Proper tuning can lead to significant performance gains, as seen in Pinterest's experience with HBase, where latency improvements were critical for real-time requests.

Common Pitfalls

1
Failing to account for incompatible changes during upgrades can lead to significant downtime.
Many systems require complete shutdowns for upgrades due to API and wire format changes, which can disrupt services if not planned properly.
2
Neglecting to verify data consistency during replication can result in data loss or corruption.
Asynchronous replication can introduce lag, so it's crucial to implement verification tools like checkr to ensure data integrity.

Related Concepts

Hbase Architecture
Data Replication Strategies
Performance Tuning In Distributed Systems