Improving MySQL® Cluster Uptime: Making MGR Viable at Scale

Siddharth Singh, Raja Sriram Ganesan, Amit Jain, Debadarsini Nayak

Uber

•

Siddharth Singh, Raja Sriram Ganesan, Amit Jain, Debadarsini Nayak

•13 min read•advanced•

--

•View Original

JavaMySQLOracle

Overview

This article discusses how Uber improved MySQL cluster uptime by adopting MySQL Group Replication (MGR) at scale. It details the automated operations for onboarding, offboarding, and rebalancing clusters, along with the failover logic and benchmarking results that demonstrate the system's reliability and efficiency.

What You'll Learn

1

How to automate the onboarding process for MySQL clusters using MySQL Group Replication

2

Why maintaining a minimum number of nodes in a consensus group is critical for high availability

3

How to implement effective rebalancing workflows for MySQL clusters

Prerequisites & Requirements

Understanding of MySQL Group Replication concepts
Familiarity with database management and operations(optional)

Key Questions Answered

How does Uber automate the onboarding of MySQL clusters?

Uber automates the onboarding of MySQL clusters by developing a control plane that orchestrates the process. This includes selecting a healthy bootstrap node, adding other nodes to the group, and ensuring they sync data before turning off the original replication process, making the new setup the primary source of truth.

What steps are involved in offboarding a cluster from consensus?

The offboarding process involves gracefully removing nodes from the consensus group, starting with secondary nodes. Once removed, they are reconfigured to operate in a standard asynchronous replication setup, ensuring minimal disruption and maintaining data integrity.

What are the consistency guarantees during primary failover in MySQL?

During primary failover, the new primary can either be made available to application traffic immediately or access can be restricted until the replication backlog is applied. This ensures either quick recovery or strict consistency for read operations, depending on the chosen approach.

What are the key learnings from implementing MySQL Group Replication at scale?

Key learnings include the importance of monitoring memory utilization due to increased overhead with MGR, and the need for a cautious approach to bootstrapping a new consensus group to prevent split-brain scenarios, ensuring data consistency and system integrity.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Used for implementing Group Replication to enhance uptime and reliability.

Key Actionable Insights

1
Implement automated workflows for onboarding and offboarding MySQL clusters to enhance operational efficiency.
Automating these processes minimizes manual intervention and reduces the risk of human error, leading to more reliable cluster management.

2
Regularly monitor memory usage of MySQL Group Replication to ensure nodes can handle peak workloads.
Proactive monitoring helps prevent performance degradation and ensures that the infrastructure can support growing demands.

3
Adopt a structured approach to handling node replacements in MySQL clusters.
This ensures minimal disruption and maintains the stability of the consensus group, which is crucial for high availability.

Common Pitfalls

1

Misusing the group_replication_bootstrap_group command can lead to split-brain scenarios.

This happens when multiple nodes believe they are the primary, causing data inconsistencies. To avoid this, ensure a controlled bootstrapping process with confirmations from both the node and the routing layer.

Related Concepts

Mysql Group Replication

High Availability In Databases

Database Management Best Practices