Mussel — Airbnb’s Key-Value Store for Derived Data

How Airbnb built a persistent, high availability and low latency key-value storage engine for accessing derived data from offline and…

Shouyan guo
11 min readintermediate
--
View Original

Overview

The article discusses Mussel, Airbnb's scalable key-value store designed for derived data. It outlines the evolution of data storage solutions at Airbnb, detailing the technologies used and the architectural improvements that led to Mussel's development.

What You'll Learn

1

How to implement a scalable key-value store using HRegion and Kafka

2

Why leaderless replication improves read scalability in distributed systems

3

How to manage partition mapping with Apache Helix

4

When to use bulk load pipelines for data ingestion

Prerequisites & Requirements

  • Understanding of distributed systems and key-value stores
  • Familiarity with Apache Kafka and Apache Helix(optional)

Key Questions Answered

What are the key features of Airbnb's Mussel key-value store?
Mussel is designed for high reliability, availability, scalability, and low latency. It supports both real-time and batch-update data with timestamp-based conflict resolution, and it utilizes HRegion as its storage engine, allowing efficient read and write operations.
How does Mussel handle data partitioning and replication?
Mussel uses Apache Helix to manage partition mapping, allowing for 1024 logical shards across multiple storage nodes. Kafka is employed for leaderless replication, ensuring consistent write ordering and improved read scalability.
What improvements were made from Nebula to Mussel?
Mussel improved upon Nebula by integrating both read and write capabilities for real-time and batch-update data, enhancing scalability and reducing the maintenance overhead associated with using multiple storage systems.
What is the performance of the Mussel key-value store?
Mussel has achieved over 99.9% availability, with an average read QPS exceeding 800k and write QPS over 35k, while maintaining an average P95 read latency of less than 8ms, demonstrating its efficiency in handling high traffic.

Key Statistics & Figures

Availability
>99.9%
This reflects Mussel's reliability in production environments.
Average read QPS
>800k
Indicates Mussel's capacity to handle high read request volumes.
Average write QPS
>35k
Demonstrates Mussel's efficiency in processing write operations.
Average P95 read latency
<8ms
Shows the system's responsiveness during peak loads.
Data stored in production clusters
~130TB
This is the total amount of data managed by Mussel across ~4000 tables.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Hregion
Used as the primary storage engine in Mussel.
Backend
Kafka
Serves as a write-ahead log for leaderless replication.
Backend
Apache Helix
Manages partition mapping and shard allocation.
Backend
Spark
Used for data transformation and loading into Mussel.
Storage
S3
Stores HFile data for batch updates.

Key Actionable Insights

1
Implementing a leaderless architecture can significantly enhance read scalability in distributed systems.
By allowing any node to handle read requests, systems can better manage high traffic loads without being bottlenecked by a single leader node.
2
Utilizing Apache Helix for partition management simplifies scaling and resource allocation in large data systems.
This approach automates the mapping of logical shards to physical nodes, reducing manual overhead and increasing system resilience.
3
Adopting a bulk load strategy can drastically reduce data ingestion times and costs.
Mussel's ability to only load delta data instead of full snapshots has improved efficiency, allowing for significant reductions in daily data loads.

Common Pitfalls

1
Failing to manage partition mappings effectively can lead to scalability issues.
As data grows, manual adjustments become cumbersome and can hinder performance. Using tools like Apache Helix can automate this process and improve system resilience.
2
Neglecting the impact of write operations on read performance can degrade user experience.
In a read-heavy environment, ensuring that write traffic does not overwhelm read paths is crucial. Implementing a leaderless architecture with Kafka can help mitigate this issue.

Related Concepts

Distributed Systems
Key-value Stores
Data Partitioning
Eventual Consistency