Achieving High Availability with distributed database on Kubernetes at Airbnb

Artem Danilov

How to achieve high availability with distributed databases on Kubernetes

Airbnb

•

Artem Danilov

•7 min read•advanced•

--

•View Original

AWSKubernetesSQL

Overview

The article discusses Airbnb's innovative approach to achieving high availability with distributed databases on Kubernetes. It outlines the challenges faced, the strategies implemented, and best practices developed for managing databases in a cloud environment.

What You'll Learn

1

How to deploy a distributed database cluster across multiple Kubernetes clusters

2

Why using AWS EBS improves database reliability and latency handling

3

How to implement a custom Kubernetes operator for database management

4

When to utilize stale reads to mitigate latency spikes in distributed databases

Prerequisites & Requirements

Understanding of Kubernetes and distributed databases
Familiarity with AWS services, particularly EBS(optional)

Key Questions Answered

How does Airbnb achieve high availability with distributed databases on Kubernetes?

Airbnb achieves high availability by deploying distributed database clusters across multiple Kubernetes clusters in different AWS availability zones. This setup limits the blast radius of issues and ensures that even if one cluster faces problems, others remain operational, thus maintaining overall system reliability.

What challenges does Kubernetes present for managing stateful services like databases?

Kubernetes poses challenges for stateful services due to its lack of awareness of data distribution across nodes. This requires careful data handling during node replacements to prevent data quorum loss and service disruption, necessitating strategies like using AWS EBS for storage volume management.

What strategies does Airbnb use to handle node replacements in their database clusters?

Airbnb categorizes node replacement events into database-initiated, proactive infrastructure, and unplanned failures. They implement custom checks and admission hooks in their Kubernetes operator to manage these events safely, ensuring data consistency and availability during replacements.

How does Airbnb mitigate latency spikes when using AWS EBS?

To mitigate latency spikes with AWS EBS, Airbnb implemented a storage read timeout session variable that allows queries to retry against other storage nodes. They also leverage stale reads to serve requests independently from replicas, reducing the impact of leader latency during spikes.

Key Statistics & Figures

Database clusters handling

3M QPS

The largest database cluster at Airbnb processes 3 million queries per second across 150 storage nodes.

Data storage

300+ TB

Airbnb's database setup stores over 300 terabytes of data across 4 million internal shards.

Availability

99.95%

The techniques implemented in their infrastructure ensure a high availability rate of 99.95%.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used for managing distributed database clusters in a cloud environment.

Storage

AWS Ebs

Provides durable storage volumes for database nodes, facilitating quick reattachment during node replacements.

Automation

K8s Operator

Custom operator used to manage database operations and node replacements effectively.

Key Actionable Insights

1
Implement a distributed database across multiple Kubernetes clusters to enhance availability.
This approach limits the impact of failures to a single cluster, ensuring that the overall system remains operational even during issues in one area.

2
Utilize AWS EBS for its durability and quick reattachment capabilities during node replacements.
This strategy helps maintain high availability while simplifying the management of storage volumes in a cloud environment.

3
Adopt a custom Kubernetes operator to manage complex database operations effectively.
A custom operator can automate and tailor Kubernetes operations to the specific needs of your database application, enhancing reliability and performance.

4
Incorporate stale reads in your database queries to reduce latency during peak loads.
This allows your application to continue serving requests quickly, even when the primary data source is experiencing delays.

Common Pitfalls

1

Overlooking the complexity of managing stateful services in Kubernetes can lead to data loss.

Many organizations underestimate the challenges of data handling during node replacements. Proper strategies must be implemented to ensure data consistency and prevent service disruption.

Related Concepts

Distributed Databases

Kubernetes Management

AWS Cloud Services

High Availability Strategies