MySQL High Availability at GitHub

Shlomi Noach

GitHub uses MySQL as its main datastore for all things non-git, and its availability is critical to GitHub’s operation. The site itself, GitHub’s API, authentication and more, all require database…

GitHub

•

Shlomi Noach

•17 min read•advanced•

--

•View Original

ConsulHAProxyMySQL

Overview

The article discusses GitHub's MySQL high availability strategy, detailing the transition from a VIP and DNS-based discovery system to a more robust solution utilizing orchestrator, Consul, and GLB. It emphasizes the importance of quick failover detection and the ability to maintain service availability across data centers.

What You'll Learn

1

How to implement a high availability solution using orchestrator, Consul, and GLB

2

Why semi-synchronous replication can lead to lossless failovers in MySQL

3

When to use anycast for network routing in a distributed database setup

Prerequisites & Requirements

Understanding of MySQL replication and high availability concepts
Familiarity with orchestrator and Consul(optional)

Key Questions Answered

How does GitHub ensure high availability for MySQL?

GitHub employs a high availability strategy that includes orchestrator for failover detection, Consul for service discovery, and GLB for load balancing. This setup allows for quick promotion of replicas to primary status and minimizes downtime during failovers, achieving typically between 10 and 13 seconds of total outage time.

What are the limitations of GitHub's MySQL high availability solution?

The solution masks the identity of applications from the primary database, which can lead to issues like split-brain scenarios during data center isolation. Additionally, there are unhandled scenarios such as outages of Consul or partial data center isolation that can complicate failover processes.

What improvements were made to the MySQL high availability strategy at GitHub?

The new strategy eliminated the reliance on VIPs and DNS for primary discovery, instead using anycast and a decoupled architecture with orchestrator, Consul, and GLB. This change enhances reliability, reduces outage times, and simplifies failover processes.

Key Statistics & Figures

Total outage time during failovers

Typically between 10 and 13 seconds

This applies to most failover scenarios, with up to 20 seconds in less frequent cases and up to 25 seconds in extreme cases.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Used as the main datastore for GitHub's non-git operations.

Backend

Orchestrator

Handles failover detection and management.

Service Discovery

Consul

Provides a highly available key-value store for primary identities.

Load Balancer

Glb/Haproxy

Acts as a proxy layer between clients and MySQL writer nodes.

Networking

Anycast

Used for routing client requests to the nearest available database node.

Key Actionable Insights

1
Implementing anycast in your database architecture can significantly enhance routing efficiency and reduce failover times.
By using anycast, you ensure that clients are always routed to the nearest available database node, which can improve performance and reliability in multi-data center setups.

2
Utilizing semi-synchronous replication can help achieve lossless failovers, ensuring that no data is lost during primary node transitions.
This replication method requires that changes be acknowledged by at least one replica before the primary confirms a transaction, which is crucial for maintaining data integrity during failovers.

3
Regularly evaluate and test your high availability solutions to identify potential bottlenecks or failure points.
Continuous testing can help ensure that your failover processes are efficient and that your systems can handle unexpected outages without significant downtime.

Common Pitfalls

1

Failing to account for split-brain scenarios during data center isolation can lead to data inconsistency.

This occurs when both the primary and a replica in an isolated data center accept writes independently. Implementing a reliable STONITH mechanism can help mitigate this risk.

Related Concepts

High Availability Strategies In Distributed Systems

Mysql Replication Techniques

Service Discovery Mechanisms