Mitigating replication lag and reducing read load with freno

At GitHub, we use MySQL as the main database technology backing our services. We run classic MySQL master-replica setups, where writes go to the master, and replicas replay master’s changes asynchronously. To be able to serve our traffic we read data from the MySQL replicas.

GitHub Engineering
16 min readintermediate
--
View Original

Overview

The article discusses how GitHub mitigates replication lag and reduces read load on MySQL databases using a service called Freno. It highlights the challenges of asynchronous replication, the importance of maintaining low replication lag, and the implementation of throttling mechanisms to improve database performance.

What You'll Learn

1

How to implement batching for large database updates to minimize replication lag

2

Why throttling is essential for managing database write loads during heavy operations

3

How to utilize Freno to improve read routing from MySQL replicas

Prerequisites & Requirements

  • Understanding of MySQL replication and database architecture
  • Familiarity with Freno and its integration in applications(optional)

Key Questions Answered

What is replication lag and why is it important?
Replication lag is the delay between when changes are made on the MySQL master and when they are reflected on replicas. It is crucial to minimize this lag to ensure users see the most up-to-date data and to maintain a good user experience.
How does Freno help in managing database write loads?
Freno continuously monitors replication lag and provides recommendations to applications on when to throttle writes. This allows applications to manage their write operations more effectively, reducing the risk of overwhelming the database and ensuring data consistency.
How can applications reduce read load on MySQL masters?
Applications can reduce read load on MySQL masters by routing read requests to replicas when the replication lag is within acceptable limits. Freno provides the necessary lag information to determine when it is safe to read from replicas, allowing for better load distribution.

Key Statistics & Figures

Expected replication lag
sub-second
GitHub aims for sub-second replication lag to ensure timely data visibility on replicas.
Percentage of requests routed to replicas
30%
By using Freno, GitHub managed to route approximately 30% of requests that were previously directed to the master.
Replication delay processing time
less than 600ms
95% of the time, the system waits less than 600ms to ensure data has been replicated before executing indexing jobs.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Mysql
Used as the main database technology for GitHub's services.
Backend Service
Freno
Central throttling service to manage replication lag and write operations.
Tool
Pt-archiver
Used for archiving or purging old data with built-in replication lag throttling.
Tool
Gh-ost
Schema migration tool that integrates with Freno for throttling during massive operations.

Key Actionable Insights

1
Implement batching for large database updates to minimize replication lag.
By breaking large updates into smaller subtasks, you can ensure that replicas can keep up with changes, thereby reducing the chances of stale data being served to users.
2
Utilize Freno to manage write operations effectively.
Freno's ability to monitor replication lag allows applications to throttle writes during heavy operations, ensuring that the database remains responsive and that users receive timely updates.
3
Adopt a proactive approach to monitoring replication lag.
Regularly polling for replication lag can help identify potential issues before they affect user experience, allowing for timely adjustments in application behavior.

Common Pitfalls

1
Failing to manage replication lag can lead to stale data being served to users.
Without proper monitoring and throttling, applications may overwhelm MySQL replicas, causing delays in data visibility and a poor user experience.
2
Using a single method for throttling across different applications can lead to inefficiencies.
Different applications may have varying requirements for managing write loads, and a one-size-fits-all approach can result in resource wastage and operational issues.

Related Concepts

Mysql Replication
Database Performance Optimization
Throttling Mechanisms In Distributed Systems