Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go

Xu Ning, Maxim Fateev
16 min readintermediate
--
View Original

Overview

Cherami is a distributed, scalable, durable, and highly available message queue system developed by Uber Engineering to transport asynchronous tasks. It is designed to be resilient and fault-tolerant, allowing Uber's mission-critical components to depend on it for message delivery.

What You'll Learn

1

How to implement a durable and scalable message queue system using Cherami

2

Why eventual consistency is critical in distributed systems

3

How to handle message redelivery and failure recovery in Cherami

Prerequisites & Requirements

  • Understanding of distributed systems concepts
  • Familiarity with Go programming language

Key Questions Answered

What is Cherami and how does it function as a message queue system?
Cherami is a distributed, scalable, and durable message queue system developed by Uber Engineering. It allows asynchronous task transport and is designed to be resilient and fault-tolerant, ensuring message delivery even during hardware failures or network partitions.
How does Cherami ensure durability and fault tolerance?
Cherami achieves durability and fault tolerance by replicating messages across different storage hosts. This replication ensures that messages can be reliably read, even if some hardware fails, allowing Cherami to continue accepting new messages.
What are the key design elements of Cherami?
Key design elements of Cherami include failure recovery and replication, scaling of writes, and consumption handling. These elements ensure that Cherami can handle high throughput and provide reliable message delivery while maintaining system performance.
What are AP and CP queues in the context of Cherami?
AP queues in Cherami allow for eventual consistency and do not require quorum-level consistency during network partitions, enabling writes on both sides. CP queues require linearizable extent creation to ensure only one partition can create a new extent during a partition, maintaining order.

Key Statistics & Figures

Tasks transported daily
hundreds of millions
Cherami currently transports hundreds of millions of tasks durably per day among Uber Engineering’s microservices.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing Cherami can significantly improve the resilience of your distributed systems by ensuring message durability and fault tolerance.
This is particularly important for applications that require high availability and cannot afford message loss during hardware failures.
2
Utilizing the competing consumers pattern in Cherami allows for efficient task processing across multiple workers, enhancing scalability.
This approach is beneficial in high-load scenarios where tasks need to be processed quickly and efficiently.
3
Understanding the trade-offs between AP and CP queues can help you design systems that meet your specific availability and consistency requirements.
Choosing the right queue type is crucial for applications that operate in environments with varying network reliability.

Common Pitfalls

1
A common pitfall in implementing message queues is underestimating the complexity of failure recovery and message redelivery.
Without a robust strategy for handling failures, systems can experience message loss or delays, undermining the reliability of the entire application.

Related Concepts

Distributed Systems
Message Queuing
Fault Tolerance
Eventual Consistency