How Uber Conquered Database Overload: The Journey from Static Rate-Limiting to Intelligent Load Management

Dhyanam Vaidya, Prathamesh Deshpande, Mike Ma, Chaitanya Yalamanchili
15 min readadvanced
--
View Original

Overview

This article details Uber's multi-year evolution from static, quota-based rate limiting to an intelligent, priority-aware load management system for their distributed databases (Docstore and Schemaless). The journey progresses through three phases: initial CoDel-based queuing, priority-aware Cinnamon load shedding, and finally a unified load shedding engine with pluggable signals, achieving 80% throughput improvement and ~70% P99 latency reduction under overload conditions.

What You'll Learn

1

Why concurrency-based signals are more reliable than QPS-based rate limiting for detecting database overload

2

How to implement priority-aware load shedding using CoDel queues and Cinnamon's PID-based control

3

How to design a pluggable overload detection framework using a Bring Your Own Signal (BYOS) architecture

4

When to use per-tenant concurrency limits (Scorecard) versus system-wide load shedding for multitenant fairness

5

Why placing overload management close to the storage layer is more effective than managing it at the stateless routing layer

Prerequisites & Requirements

  • Understanding of distributed database architectures (sharding, replication, leader-follower patterns)
  • Familiarity with rate limiting concepts (token bucket, quota-based systems, 429 responses)
  • Basic understanding of queuing theory and Little's Law (Concurrency = Throughput × Latency)
  • Knowledge of PID controllers and feedback-based control systems(optional)
  • Experience with multitenant system design and noisy neighbor problems(optional)

Key Questions Answered

Why did Uber's quota-based rate limiting approach fail for database overload protection?
Uber's quota-based rate limiting failed for three key reasons: it added unnecessary complexity by requiring a Redis call for every request (introducing a new failure point), the cost model was too imprecise because MySQL full table scans and single-row reads were assigned identical capacity costs, and static quotas required constant manual adjustment by stakeholders, making them ineffective in multitenant environments.
How does CoDel (Controlled Delay) improve database queuing under overload compared to FIFO?
CoDel improves queuing by switching from FIFO to adaptive LIFO under pressure. Pure FIFO creates a trap during overload where old requests accumulate, get abandoned or retried by clients, wasting work while fresh requests wait idle. CoDel monitors how long requests wait in the queue rather than queue length, and under pressure favors newer requests that still have a chance to succeed, shedding stale work and improving responsiveness.
What is Uber's Cinnamon load shedder and how does it differ from CoDel?
Cinnamon is Uber's priority-aware load shedder that replaced CoDel. Unlike CoDel which treated all requests equally, Cinnamon considers request rank (priority tiers from t0 for critical infrastructure to t5 for least important), dynamic system state, and relative workload importance. It uses a PID-based controller to dynamically adjust queue timeouts and inflight limits based on real-time latency and error signals, rather than CoDel's static fixed timeouts.
How does Uber ensure fairness in a multitenant database environment?
Uber uses the Scorecard engine, a rule-based admission control component that enforces per-tenant concurrency limits independently of system load. It isolates and caps misbehaving tenants without disrupting others, balancing stability during normal load with strict limits under stress. Scorecard operates in parallel with dynamic overload detectors, providing predictable fairness and blast radius control when multiple tenants compete for shared resources.
Why is concurrency a better signal than QPS for detecting database overload?
Concurrency (number of operations currently in flight) directly reflects system load following Little's Law: Concurrency = Throughput × Latency. In stateful systems, concurrency maps closely to resource usage, making it a more dependable indicator than QPS. Simple QPS-based rate limiting is too coarse because it fails to account for workload variability, often shedding too late or too early regardless of actual system pressure.
What performance improvements did Uber achieve with the unified Cinnamon load shedding engine?
Uber's unified Cinnamon engine achieved an 80% increase in throughput under overload (5,400 vs 3,000 QPS average), approximately 70% reduction in P99 latency (1.0s vs 3.1s for upserts), approximately 93% fewer goroutines during overload (peak 10,000 vs 150,000), and approximately 60% lower heap usage (1 GB max vs 5-6 GB spikes). The PID-based controller also produced smoother, more predictable shedding behavior compared to token bucket limiters.
What are Uber's plug-in regulators and what overload scenarios do they address?
Plug-in regulators are node-local overload detectors that enforce invariants the system must not violate, addressing subtle skews that concurrency saturation alone misses. Uber uses four regulators: a write bytes regulator limiting concurrent write volume to prevent I/O saturation, a partition key regulator throttling hot partition key traffic, a memory regulator tracking free process memory, and a goroutines regulator tracking total goroutine count against thresholds.
How does Uber handle overload signals from remote nodes in a distributed database?
Uber enhanced Cinnamon to support pluggable external signals like follower commit index lag, enabling globally informed, priority-aware shedding decisions within the same admission control path. Previously, external token-bucket-based rate limiters handled remote shedding but proved ineffective at scale, causing split-brain behaviors. The unified approach consolidates local and remote overload logic into a single control loop using a BYOS (Bring Your Own Signal) framework.

Key Statistics & Figures

Monthly active users served
170 million+
Riders, Uber Eats users, drivers, and couriers across all Uber services
Throughput increase under overload
80%
QPS average of 5,400 versus 3,000 after migrating to unified Cinnamon engine
P99 latency reduction
~70%
Upsert average of 1.0 seconds versus 3.1 seconds
Goroutine reduction during overload
~93%
Peak 10,000 versus 150,000 goroutines
Heap usage reduction
~60%
1 GB max versus 5-6 GB spikes
Database storage scale
Tens of petabytes
Operational data across thousands of clusters
Request throughput
Tens of millions per second
Requests served by Docstore and Schemaless databases with billions of rows read or updated

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Docstore
Uber's in-house distributed database built on MySQL supporting transactions with full CRUD operations
Database
Schemaless
Uber's in-house distributed database built on MySQL optimized for append-only workloads
Database
Mysql
Underlying storage engine for Docstore and Schemaless, with locally attached NVMe SSDs
Cache
Redis
Used in the initial quota-based rate limiting approach to store quota usage across stateless routing nodes
Consensus Protocol
Raft
Coordinates replication across partitions (one leader, two followers) to ensure strong consistency
Algorithm
Codel (controlled Delay)
Initial queuing mechanism adapted from networking to combat bufferbloat, using adaptive LIFO under pressure
Load Shedding Framework
Cinnamon
Priority-aware load shedder with PID-based control and Auto Tuner for dynamic inflight limit adjustment
Control System
Pid Controller
Dynamically adjusts queue timeouts and inflight limits based on real-time latency and error rate signals
Storage Hardware
Nvme SSD
Locally attached storage for MySQL nodes to support high-throughput, low-latency workloads

Key Actionable Insights

1
Place overload protection as close to the storage layer as possible rather than at the routing or gateway layer. Uber's initial approach of rate limiting in the stateless query engine failed because it couldn't accurately track real-time health across thousands of partitions, requiring expensive Redis calls and introducing a new failure point.
In stateful systems, the storage layer has the most accurate picture of actual resource usage, making it the ideal location for admission control decisions. This eliminates the need to propagate health information across stateless nodes.
2
Use concurrency (in-flight operations) rather than QPS as your primary overload signal for database systems. Following Little's Law (Concurrency = Throughput × Latency), concurrency directly reflects actual system load and maps closely to resource usage in stateful systems, while QPS-based limiting fails to account for workload variability.
A full table scan query that returns one row and a simple point lookup may have the same QPS cost but vastly different resource consumption. Concurrency captures this difference naturally.
3
Implement priority-aware load shedding that drops low-priority traffic (batch jobs, pipelines, garbage collection) before user-facing requests. At Uber, many overloads stemmed from asynchronous jobs that shouldn't have the same survivability as ride requests or real-time pricing queries.
Use a tiering model (e.g., t0 for critical infrastructure, t1 for user-facing online traffic, t5 for least important) and assign default priorities based on calling service when no explicit priority is present.
4
Replace static rate limiting configurations with PID-based control loops that dynamically adjust queue timeouts and inflight limits. CoDel's fixed, static wait times led to thundering herd problems where simultaneously rejected requests all retried at once, triggering repeated cycles of overload and rejection.
PID regulation acts like a dimmer switch rather than a hammer, incorporating system history and directional trends for smooth, sustained overload control. Uber saw significantly reduced premature shedding after adopting this approach.
5
Combine system-wide overload protection with per-tenant fairness enforcement operating in parallel. During system-wide stress, shed by priority; when a single noisy actor hogs resources without triggering global overload, use per-tenant concurrency limits (like Uber's Scorecard) that work independently of system load.
This dual approach ensures that both systemic overload and localized noisy neighbor issues are handled with precision, reducing blast radius during incidents.
6
Design your load management system as a pluggable platform with a Bring Your Own Signal (BYOS) architecture rather than building it for a single overload signal. Uber's initial design was excellent for concurrency-based shedding but wasn't extensible to new signals like follower commit index lag or memory pressure.
As systems grow, new overload patterns will inevitably emerge. A pluggable framework lets teams embed new overload signals and route them to the right control path without redesigning the core engine.

Common Pitfalls

1
Using QPS-based or bytes-processed rate limiting as the primary overload signal for databases. Uber found that a query performing a full table scan but returning a single row was assigned the same capacity cost as a query reading only one row, making quota enforcement unreliable and failing to differentiate lightweight from heavyweight operations.
Concurrency (in-flight operations) is a far more reliable signal because it directly reflects actual system load following Little's Law, naturally capturing the difference in resource consumption between different query types.
2
Placing rate limiting logic in the stateless routing layer where it must maintain real-time health information for thousands of storage partitions. This introduces massive tracking overhead, requires an external dependency (Redis) for every request, and undermines the scalability benefits of a stateless architecture.
Move admission control to the stateful storage layer where the state lives and where the system has full context about actual resource pressure.
3
Using static, fixed queue timeouts and concurrency limits (as in CoDel) that require frequent manual tuning. These static parameters led to thundering herd problems at Uber where simultaneously rejected requests all retried at once, triggering repeated cycles of overload and rejection that amplified the blast radius.
PID-based dynamic control adapts to real-time conditions automatically, absorbing pressure without overreacting and preventing the vicious cycle of premature shedding followed by retry storms.
4
Treating all requests equally during load shedding without priority awareness. CoDel dropped low-priority batch jobs and user-facing ride requests alike, leading to bad customer experiences and increased on-call load during overload events.
Implement a tiering model for request priority and ensure the load shedder drops lower-priority traffic first. This protects critical user-facing flows even during severe overload.
5
Using separate, disconnected rate limiters for different overload signals (e.g., token bucket for commit lag, concurrency limiter for local load). Uber found that these external token-bucket-based limiters caused split-brain behaviors and globally suboptimal shedding decisions when not coordinated.
Unify all overload signals into a single control loop with a pluggable architecture so that shedding decisions are holistic, consistent, and consider all sources of pressure simultaneously.

Related Concepts

Little's Law
Codel (controlled Delay) Algorithm
Pid Controller Theory
Raft Consensus Protocol
Load Shedding And Admission Control
Token Bucket Rate Limiting
Adaptive Lifo Queuing
Bufferbloat
Thundering Herd Problem
Noisy Neighbor Problem In Multitenant Systems
Database Sharding And Partitioning
Leader-follower Replication
Commit Index Lag
Auto-tuning Concurrency Limits
Fifo Vs Lifo Queue Strategies
Microservices Overload Protection