MySQL At Uber

Banty Kumar, Debadarsini Nayak, Raja Sriram Ganesan, Amit Jain
15 min readadvanced
--
View Original

Overview

The article discusses the MySQL fleet at Uber, which consists of over 2,300 independent clusters that support critical operations for the platform. It highlights the architecture, control plane operations, and improvements made to enhance MySQL availability from 99.9% to 99.99%.

What You'll Learn

1

How to manage MySQL clusters effectively at scale

2

Why MySQL control plane architecture is crucial for high availability

3

How to implement primary failover processes in MySQL

4

When to use automated schema changes in MySQL

Prerequisites & Requirements

  • Understanding of MySQL architecture and operations
  • Familiarity with Kubernetes and Docker(optional)

Key Questions Answered

How does Uber ensure high availability of its MySQL fleet?
Uber has improved MySQL fleet availability from 99.9% to 99.99% through various optimizations and a re-architecture of the control plane. This includes implementing automated workflows for primary failover and node management, ensuring minimal downtime and data loss.
What are the main components of the MySQL control plane at Uber?
The MySQL control plane at Uber consists of several components including the control plane, data plane, discovery plane, observability tools, and backup/restore mechanisms. Each component plays a critical role in managing the lifecycle and health of MySQL clusters.
What is the primary failover process in Uber's MySQL architecture?
The primary failover process involves automatically changing the primary node of a cluster from one host to another to maintain write availability. This process is critical for ensuring high availability and is monitored continuously for any degradation in the primary node's health.
How does Uber handle schema changes in its MySQL databases?
Uber automates schema changes through a self-serve workflow that utilizes MySQL's instant alter or Percona's pt-online-schema-change. This ensures safe, non-blocking updates while allowing for dry-run capabilities to verify compatibility before applying changes.

Key Statistics & Figures

Number of MySQL clusters at Uber
over 2,300
This extensive fleet supports a vast array of operations critical to Uber's platform.
Improvement in MySQL availability
from 99.9% to 99.99%
This enhancement was achieved through various optimizations and a re-architecture of the control plane.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a robust control plane for MySQL can significantly enhance operational efficiency and reliability.
By automating workflows for cluster management and failover processes, teams can reduce manual intervention and improve system resilience, which is crucial for high-availability applications.
2
Utilizing a discovery plane simplifies client interactions with MySQL clusters.
By abstracting the underlying hardware changes, the discovery plane allows services to connect seamlessly to their MySQL clusters, enhancing system flexibility and reducing downtime during maintenance.
3
Regularly review and optimize your MySQL backup and restore processes.
Ensuring that backup processes are automated and maintain a low RPO and RTO can protect against data loss and improve recovery times in case of failures.

Common Pitfalls

1
Tightly coupling the MySQL control plane with underlying infrastructure processes can lead to operational reliability issues.
As the MySQL fleet grows, this coupling can block infrastructure placement operations, making it difficult to manage workflows effectively.

Related Concepts

Mysql Architecture And Operations
High Availability Strategies
Automated Workflows In Database Management