Addressing the last mile problem with MySQL high availability

Sundar Raman Ganesh

•

Sundar Raman Ganesh

•8 min read•advanced•

--

•View Original

MariaDBMySQL

Overview

The article discusses how LinkedIn addresses the last mile problem in MySQL high availability by implementing Openark Orchestrator for seamless failover and improved service reliability. It outlines the challenges faced and the solutions adopted to enhance the availability of MySQL clusters.

What You'll Learn

1

How to implement Openark Orchestrator for MySQL high availability

2

Why monitoring transaction sizes is crucial for database performance

3

How to measure absolute replication lag in MySQL clusters

4

When to apply candidate selection strategies in failover scenarios

Prerequisites & Requirements

Understanding of MySQL replication and high availability concepts
Familiarity with Openark Orchestrator(optional)

Key Questions Answered

What challenges does LinkedIn face with MySQL high availability?

LinkedIn faces challenges such as managing huge transactions that can stall the database, measuring replication lag accurately, selecting the right candidate for failover, and preventing multiple failovers during network issues. These challenges impact the overall availability of MySQL services.

How does Openark Orchestrator improve MySQL failover processes?

Openark Orchestrator automates the failover process by promoting the most consistent replica to primary without manual intervention. It incorporates mechanisms to detect failures, manage network partitions, and handle both GTID and non-GTID MySQL hosts, enhancing overall system resilience.

What is the significance of monitoring transaction sizes in MySQL?

Monitoring transaction sizes helps prevent performance degradation caused by rogue transactions that modify millions of rows. By setting thresholds for modified rows, LinkedIn can proactively manage database health and avoid unnecessary failovers.

How does LinkedIn measure replication lag in its MySQL clusters?

LinkedIn measures replication lag using a heartbeat mechanism that records timestamps in the primary servers. This allows for accurate calculation of the time difference between the primary and replicas, ensuring reliable failover decisions.

Key Statistics & Figures

Uptime maintained by MySQL SRE team

99.99%

This statistic highlights the reliability of the MySQL service provided to LinkedIn's internal customers.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Used as the primary relational database for applications at LinkedIn.

Tool

Openark Orchestrator

Handles seamless failover of primary nodes in MySQL clusters.

Key Actionable Insights

1
Implement a monitoring system to track transaction sizes and replication lag in MySQL clusters.
This proactive approach allows teams to identify potential performance issues before they escalate, ensuring higher availability and reliability of database services.

2
Adopt Openark Orchestrator for automated failover management in MySQL environments.
This tool simplifies the failover process, reduces downtime, and enhances the resilience of your database infrastructure, making it easier to maintain high availability.

3
Establish clear thresholds for transaction sizes to prevent performance bottlenecks.
By defining these thresholds, teams can mitigate the impact of rogue transactions and maintain optimal database performance.

4
Utilize heartbeat mechanisms to accurately measure replication lag.
This ensures that failover decisions are based on reliable data, reducing the risk of data loss during primary node failures.

Common Pitfalls

1

Failing to monitor transaction sizes can lead to performance degradation.

When large transactions are not tracked, they can overwhelm the database, causing stalls and unnecessary failovers. Implementing monitoring can prevent these issues.

2

Relying solely on Seconds_behind_master for replication lag measurement can lead to incorrect failover decisions.

This metric can report misleading values if the I/O thread is not connected, making it essential to implement additional checks like heartbeat mechanisms.

Related Concepts

Mysql Replication

High Availability Strategies

Database Performance Monitoring

Failover Mechanisms