How Uber Optimized Cassandra Operations At Scale

Jaydeepkumar Chovatia, Gopal Mor, Runtian Liu

Uber

•

Jaydeepkumar Chovatia, Gopal Mor, Runtian Liu

•11 min read•intermediate•

--

•View Original

ApacheCassandraJava

Overview

The article discusses how Uber optimized its operations of the open-source Apache Cassandra database at scale, addressing various challenges and improvements made over time. It highlights the architecture of the Cassandra deployment, the operational hurdles faced, and the solutions implemented to enhance reliability and performance.

What You'll Learn

1

How to implement a managed service for Apache Cassandra

2

Why incremental changes can improve system reliability at scale

3

How to effectively manage node replacements in a Cassandra cluster

4

When to use Cassandra’s Lightweight Transactions for business-critical operations

Prerequisites & Requirements

Understanding of distributed databases and their operational challenges
Familiarity with Apache Cassandra and its architecture(optional)

Key Questions Answered

How does Uber manage its Cassandra operations at scale?

Uber manages its Cassandra operations by implementing a managed service that ensures 99.99% availability and 24/7 support. The service integrates with Uber’s ecosystem for configuration management and observability, allowing for efficient handling of millions of queries per second across tens of thousands of nodes.

What challenges does Uber face with node replacements in Cassandra?

Uber faces challenges such as unreliable node replacements due to hardware failures and fleet optimization needs. Issues like node decommissioning getting stuck and data inconsistency during replacements add operational overhead, requiring robust solutions to ensure reliability.

What improvements have been made to Cassandra’s Lightweight Transactions?

Improvements to Cassandra’s Lightweight Transactions include enhanced error handling within the Gossip protocol, which has significantly reduced the error rate associated with these transactions. This change has led to a more reliable operation, with no errors reported in the last twelve months.

How does Uber handle data inconsistencies in Cassandra?

Uber addresses data inconsistencies through an automated repair scheduler integrated within Cassandra. This scheduler orchestrates repairs across the cluster, ensuring that data is consistent and reducing the operational burden of manual repairs.

Key Statistics & Figures

Availability of Cassandra service

99.99%

This availability is ensured through a managed service that supports Uber’s application teams.

Queries handled per second

Tens of millions

This high query volume is supported by Uber's extensive Cassandra deployment.

Nodes in the Cassandra fleet

Tens of thousands

The fleet spans multiple regions and supports various critical workloads.

Node replacement reliability

99.99%

This reliability was achieved after implementing several improvements to the node replacement process.

Technologies & Tools

Database

Apache Cassandra

Used as a managed service to handle Uber’s core services and workloads.

Key Actionable Insights

1
Implement a proactive approach to managing orphan hint files in Cassandra to avoid performance degradation.
By purging orphan hint files regularly, you can prevent terabytes of unnecessary data from accumulating, which can slow down node decommissioning and increase operational overhead.

2
Utilize the built-in repair scheduler in Cassandra to automate data consistency checks.
This reduces the need for manual interventions and ensures that data inconsistencies are addressed promptly, improving overall system reliability.

3
Enhance your monitoring of node replacements to quickly identify and resolve issues.
By integrating JMX metrics for decommissioning and bootstrapping nodes, you can gain visibility into the state of your nodes and take necessary actions to prevent prolonged downtimes.

Common Pitfalls

1

Failing to address orphan hint files can lead to significant performance issues during node replacements.

This occurs because orphan hint files accumulate over time, causing delays in decommissioning nodes and increasing operational overhead.

2

Not monitoring the decommissioning process can result in prolonged downtimes.

If the control plane cannot probe the decommissioned state, it may lead to confusion and operational inefficiencies, necessitating manual interventions.

Related Concepts

Distributed Databases

Operational Challenges In Large-scale Systems

Data Consistency In Distributed Systems