Overview
This article details Uber's multi-year evolution from static, quota-based rate limiting to an intelligent, priority-aware load management system for their distributed databases (Docstore and Schemaless). The journey progresses through three phases: initial CoDel-based queuing, priority-aware Cinnamon load shedding, and finally a unified load shedding engine with pluggable signals, achieving 80% throughput improvement and ~70% P99 latency reduction under overload conditions.
What You'll Learn
Why concurrency-based signals are more reliable than QPS-based rate limiting for detecting database overload
How to implement priority-aware load shedding using CoDel queues and Cinnamon's PID-based control
How to design a pluggable overload detection framework using a Bring Your Own Signal (BYOS) architecture
When to use per-tenant concurrency limits (Scorecard) versus system-wide load shedding for multitenant fairness
Why placing overload management close to the storage layer is more effective than managing it at the stateless routing layer
Prerequisites & Requirements
- Understanding of distributed database architectures (sharding, replication, leader-follower patterns)
- Familiarity with rate limiting concepts (token bucket, quota-based systems, 429 responses)
- Basic understanding of queuing theory and Little's Law (Concurrency = Throughput × Latency)
- Knowledge of PID controllers and feedback-based control systems(optional)
- Experience with multitenant system design and noisy neighbor problems(optional)
Key Questions Answered
Why did Uber's quota-based rate limiting approach fail for database overload protection?
How does CoDel (Controlled Delay) improve database queuing under overload compared to FIFO?
What is Uber's Cinnamon load shedder and how does it differ from CoDel?
How does Uber ensure fairness in a multitenant database environment?
Why is concurrency a better signal than QPS for detecting database overload?
What performance improvements did Uber achieve with the unified Cinnamon load shedding engine?
What are Uber's plug-in regulators and what overload scenarios do they address?
How does Uber handle overload signals from remote nodes in a distributed database?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Place overload protection as close to the storage layer as possible rather than at the routing or gateway layer. Uber's initial approach of rate limiting in the stateless query engine failed because it couldn't accurately track real-time health across thousands of partitions, requiring expensive Redis calls and introducing a new failure point.In stateful systems, the storage layer has the most accurate picture of actual resource usage, making it the ideal location for admission control decisions. This eliminates the need to propagate health information across stateless nodes.
2Use concurrency (in-flight operations) rather than QPS as your primary overload signal for database systems. Following Little's Law (Concurrency = Throughput × Latency), concurrency directly reflects actual system load and maps closely to resource usage in stateful systems, while QPS-based limiting fails to account for workload variability.A full table scan query that returns one row and a simple point lookup may have the same QPS cost but vastly different resource consumption. Concurrency captures this difference naturally.
3Implement priority-aware load shedding that drops low-priority traffic (batch jobs, pipelines, garbage collection) before user-facing requests. At Uber, many overloads stemmed from asynchronous jobs that shouldn't have the same survivability as ride requests or real-time pricing queries.Use a tiering model (e.g., t0 for critical infrastructure, t1 for user-facing online traffic, t5 for least important) and assign default priorities based on calling service when no explicit priority is present.
4Replace static rate limiting configurations with PID-based control loops that dynamically adjust queue timeouts and inflight limits. CoDel's fixed, static wait times led to thundering herd problems where simultaneously rejected requests all retried at once, triggering repeated cycles of overload and rejection.PID regulation acts like a dimmer switch rather than a hammer, incorporating system history and directional trends for smooth, sustained overload control. Uber saw significantly reduced premature shedding after adopting this approach.
5Combine system-wide overload protection with per-tenant fairness enforcement operating in parallel. During system-wide stress, shed by priority; when a single noisy actor hogs resources without triggering global overload, use per-tenant concurrency limits (like Uber's Scorecard) that work independently of system load.This dual approach ensures that both systemic overload and localized noisy neighbor issues are handled with precision, reducing blast radius during incidents.
6Design your load management system as a pluggable platform with a Bring Your Own Signal (BYOS) architecture rather than building it for a single overload signal. Uber's initial design was excellent for concurrency-based shedding but wasn't extensible to new signals like follower commit index lag or memory pressure.As systems grow, new overload patterns will inevitably emerge. A pluggable framework lets teams embed new overload signals and route them to the right control path without redesigning the core engine.