Mussel — Airbnb’s Key-Value Store for Derived Data

Shouyan guo

How Airbnb built a persistent, high availability and low latency key-value storage engine for accessing derived data from offline and…

Airbnb

•

Shouyan guo

•11 min read•intermediate•

--

•View Original

ApacheDynamoDBJavaMySQL

Overview

The article discusses Mussel, Airbnb's scalable key-value store designed for derived data. It outlines the evolution of data storage solutions at Airbnb, detailing the technologies used and the architectural improvements that led to Mussel's development.

What You'll Learn

1

How to implement a scalable key-value store using HRegion and Kafka

2

Why leaderless replication improves read scalability in distributed systems

3

How to manage partition mapping with Apache Helix

4

When to use bulk load pipelines for data ingestion

Prerequisites & Requirements

Understanding of distributed systems and key-value stores
Familiarity with Apache Kafka and Apache Helix(optional)

Key Questions Answered

What are the key features of Airbnb's Mussel key-value store?

Mussel is designed for high reliability, availability, scalability, and low latency. It supports both real-time and batch-update data with timestamp-based conflict resolution, and it utilizes HRegion as its storage engine, allowing efficient read and write operations.

How does Mussel handle data partitioning and replication?

Mussel uses Apache Helix to manage partition mapping, allowing for 1024 logical shards across multiple storage nodes. Kafka is employed for leaderless replication, ensuring consistent write ordering and improved read scalability.

What improvements were made from Nebula to Mussel?

Mussel improved upon Nebula by integrating both read and write capabilities for real-time and batch-update data, enhancing scalability and reducing the maintenance overhead associated with using multiple storage systems.

What is the performance of the Mussel key-value store?

Mussel has achieved over 99.9% availability, with an average read QPS exceeding 800k and write QPS over 35k, while maintaining an average P95 read latency of less than 8ms, demonstrating its efficiency in handling high traffic.

Key Statistics & Figures

Availability

>99.9%

This reflects Mussel's reliability in production environments.

Average read QPS

>800k

Indicates Mussel's capacity to handle high read request volumes.

Average write QPS

>35k

Demonstrates Mussel's efficiency in processing write operations.

Average P95 read latency

<8ms

Shows the system's responsiveness during peak loads.

Data stored in production clusters

~130TB

This is the total amount of data managed by Mussel across ~4000 tables.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Hregion

Used as the primary storage engine in Mussel.

Backend

Kafka

Serves as a write-ahead log for leaderless replication.

Backend

Apache Helix

Manages partition mapping and shard allocation.

Backend

Spark

Used for data transformation and loading into Mussel.

Storage

S3

Stores HFile data for batch updates.

Key Actionable Insights

1
Implementing a leaderless architecture can significantly enhance read scalability in distributed systems.
By allowing any node to handle read requests, systems can better manage high traffic loads without being bottlenecked by a single leader node.

2
Utilizing Apache Helix for partition management simplifies scaling and resource allocation in large data systems.
This approach automates the mapping of logical shards to physical nodes, reducing manual overhead and increasing system resilience.

3
Adopting a bulk load strategy can drastically reduce data ingestion times and costs.
Mussel's ability to only load delta data instead of full snapshots has improved efficiency, allowing for significant reductions in daily data loads.

Common Pitfalls

1

Failing to manage partition mappings effectively can lead to scalability issues.

As data grows, manual adjustments become cumbersome and can hinder performance. Using tools like Apache Helix can automate this process and improve system resilience.

2

Neglecting the impact of write operations on read performance can degrade user experience.

In a read-heavy environment, ensuring that write traffic does not overwhelm read paths is crucial. Implementing a leaderless architecture with Kafka can help mitigate this issue.

Related Concepts

Distributed Systems

Key-value Stores

Data Partitioning

Eventual Consistency