Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go

Xu Ning, Maxim Fateev

Uber

•

Xu Ning, Maxim Fateev

•16 min read•intermediate•

--

•View Original

ApacheApache KafkaCassandraDockerJavaRedisThriftWebSocket

Overview

Cherami is a distributed, scalable, durable, and highly available message queue system developed by Uber Engineering to transport asynchronous tasks. It is designed to be resilient and fault-tolerant, allowing Uber's mission-critical components to depend on it for message delivery.

What You'll Learn

1

How to implement a durable and scalable message queue system using Cherami

2

Why eventual consistency is critical in distributed systems

3

How to handle message redelivery and failure recovery in Cherami

Prerequisites & Requirements

Understanding of distributed systems concepts
Familiarity with Go programming language

Key Questions Answered

What is Cherami and how does it function as a message queue system?

Cherami is a distributed, scalable, and durable message queue system developed by Uber Engineering. It allows asynchronous task transport and is designed to be resilient and fault-tolerant, ensuring message delivery even during hardware failures or network partitions.

How does Cherami ensure durability and fault tolerance?

Cherami achieves durability and fault tolerance by replicating messages across different storage hosts. This replication ensures that messages can be reliably read, even if some hardware fails, allowing Cherami to continue accepting new messages.

What are the key design elements of Cherami?

Key design elements of Cherami include failure recovery and replication, scaling of writes, and consumption handling. These elements ensure that Cherami can handle high throughput and provide reliable message delivery while maintaining system performance.

What are AP and CP queues in the context of Cherami?

AP queues in Cherami allow for eventual consistency and do not require quorum-level consistency during network partitions, enabling writes on both sides. CP queues require linearizable extent creation to ensure only one partition can create a new extent during a partition, maintaining order.

Key Statistics & Figures

Tasks transported daily

hundreds of millions

Cherami currently transports hundreds of millions of tasks durably per day among Uber Engineering’s microservices.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language

Go

Cherami is completely written in Go, which facilitates building highly performant and concurrent system software.

Database

Cassandra

Cassandra is used for metadata storage, allowing for high availability and tunable consistency.

Database

Rocksdb

RocksDB is used as the storage engine for messages, providing performance and indexing features.

Rpc Framework

Tchannel

TChannel is used for RPC in Cherami.

Library

Ringpop

Ringpop is used for health checking and group membership.

Key Actionable Insights

1
Implementing Cherami can significantly improve the resilience of your distributed systems by ensuring message durability and fault tolerance.
This is particularly important for applications that require high availability and cannot afford message loss during hardware failures.

2
Utilizing the competing consumers pattern in Cherami allows for efficient task processing across multiple workers, enhancing scalability.
This approach is beneficial in high-load scenarios where tasks need to be processed quickly and efficiently.

3
Understanding the trade-offs between AP and CP queues can help you design systems that meet your specific availability and consistency requirements.
Choosing the right queue type is crucial for applications that operate in environments with varying network reliability.

Common Pitfalls

1

A common pitfall in implementing message queues is underestimating the complexity of failure recovery and message redelivery.

Without a robust strategy for handling failures, systems can experience message loss or delays, undermining the reliability of the entire application.

Related Concepts

Distributed Systems

Message Queuing

Fault Tolerance

Eventual Consistency