Achieving High Availability with distributed database on Kubernetes at Airbnb

How to achieve high availability with distributed databases on Kubernetes

Artem Danilov
7 min readadvanced
--
View Original

Overview

The article discusses Airbnb's innovative approach to achieving high availability with distributed databases on Kubernetes. It outlines the challenges faced, the strategies implemented, and best practices developed for managing databases in a cloud environment.

What You'll Learn

1

How to deploy a distributed database cluster across multiple Kubernetes clusters

2

Why using AWS EBS improves database reliability and latency handling

3

How to implement a custom Kubernetes operator for database management

4

When to utilize stale reads to mitigate latency spikes in distributed databases

Prerequisites & Requirements

  • Understanding of Kubernetes and distributed databases
  • Familiarity with AWS services, particularly EBS(optional)

Key Questions Answered

How does Airbnb achieve high availability with distributed databases on Kubernetes?
Airbnb achieves high availability by deploying distributed database clusters across multiple Kubernetes clusters in different AWS availability zones. This setup limits the blast radius of issues and ensures that even if one cluster faces problems, others remain operational, thus maintaining overall system reliability.
What challenges does Kubernetes present for managing stateful services like databases?
Kubernetes poses challenges for stateful services due to its lack of awareness of data distribution across nodes. This requires careful data handling during node replacements to prevent data quorum loss and service disruption, necessitating strategies like using AWS EBS for storage volume management.
What strategies does Airbnb use to handle node replacements in their database clusters?
Airbnb categorizes node replacement events into database-initiated, proactive infrastructure, and unplanned failures. They implement custom checks and admission hooks in their Kubernetes operator to manage these events safely, ensuring data consistency and availability during replacements.
How does Airbnb mitigate latency spikes when using AWS EBS?
To mitigate latency spikes with AWS EBS, Airbnb implemented a storage read timeout session variable that allows queries to retry against other storage nodes. They also leverage stale reads to serve requests independently from replicas, reducing the impact of leader latency during spikes.

Key Statistics & Figures

Database clusters handling
3M QPS
The largest database cluster at Airbnb processes 3 million queries per second across 150 storage nodes.
Data storage
300+ TB
Airbnb's database setup stores over 300 terabytes of data across 4 million internal shards.
Availability
99.95%
The techniques implemented in their infrastructure ensure a high availability rate of 99.95%.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration
Kubernetes
Used for managing distributed database clusters in a cloud environment.
Storage
AWS Ebs
Provides durable storage volumes for database nodes, facilitating quick reattachment during node replacements.
Automation
K8s Operator
Custom operator used to manage database operations and node replacements effectively.

Key Actionable Insights

1
Implement a distributed database across multiple Kubernetes clusters to enhance availability.
This approach limits the impact of failures to a single cluster, ensuring that the overall system remains operational even during issues in one area.
2
Utilize AWS EBS for its durability and quick reattachment capabilities during node replacements.
This strategy helps maintain high availability while simplifying the management of storage volumes in a cloud environment.
3
Adopt a custom Kubernetes operator to manage complex database operations effectively.
A custom operator can automate and tailor Kubernetes operations to the specific needs of your database application, enhancing reliability and performance.
4
Incorporate stale reads in your database queries to reduce latency during peak loads.
This allows your application to continue serving requests quickly, even when the primary data source is experiencing delays.

Common Pitfalls

1
Overlooking the complexity of managing stateful services in Kubernetes can lead to data loss.
Many organizations underestimate the challenges of data handling during node replacements. Proper strategies must be implemented to ensure data consistency and prevent service disruption.

Related Concepts

Distributed Databases
Kubernetes Management
AWS Cloud Services
High Availability Strategies