Debugging Systems in the Cloud: MySQL, Kubernetes, and Cgroups

An overview of how we investigated and solved the issue of some Kubernetes Pods running MySQL starting up and shutting down slower than other similar Pods with the same data set.

Rodrigo Saito
9 min readintermediate
--
View Original

Overview

The article discusses the challenges faced by Shopify's KateSQL, a Database-as-a-Service platform, in managing MySQL instances on Kubernetes, particularly focusing on a performance issue caused by a bug in the Linux kernel memory cgroup controller. It details the investigation process, immediate mitigations, and eventual solutions that improved MySQL Pod startup times and overall performance.

What You'll Learn

1

How to identify performance issues in Kubernetes Pods running MySQL

2

Why upgrading Kubernetes and Linux kernel versions can resolve underlying performance issues

3

How to implement strategies for mitigating slow MySQL Pod startup times

Prerequisites & Requirements

  • Understanding of Kubernetes and MySQL operations
  • Familiarity with Google Cloud Platform and Kubernetes Engine(optional)

Key Questions Answered

What was the root cause of slow MySQL Pod startup times in KateSQL?
The root cause was identified as a bug in the Linux kernel memory cgroup controller, which led to slow memory allocation during MySQL Pod initialization. This issue was exacerbated by the number of memory cgroups present on affected nodes, indicating a potential cgroup leak.
How did Shopify mitigate the performance issues with MySQL Pods?
Shopify implemented a strategy to replace older Kubernetes cluster nodes with new ones, which improved performance. They also explored additional fixes, including upgrading to a newer version of Google Kubernetes Engine that contained relevant bug fixes.
What immediate performance improvements were observed after specific actions?
After dropping the dentry cache and replacing older Kubernetes nodes, performance improvements were noted, with MySQL Pods showing significantly faster initialization times, such as an 80G InnoDB buffer pool being initialized in just five seconds.
What are the potential workarounds for the cgroup-related performance issues?
Potential workarounds include rebooting or cordoning the cluster node VM, setting up cronjobs to drop SLAB and page caches, and isolating short-lived Pods to dedicated node pools to prevent interference with MySQL Pods.

Key Statistics & Figures

Slowest MySQL Pod initialization time
2120 seconds
This was observed for the katesql-n4sx0 instance compared to a faster Pod which initialized in 74 seconds.
Improved MySQL Pod initialization time
5 seconds
This was achieved after implementing fixes and testing a MySQL Pod restart.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Regularly monitor the number of memory cgroups on your Kubernetes nodes to identify potential leaks early.
Monitoring can help prevent performance degradation over time, especially in systems with many short-lived tasks that may contribute to cgroup bloat.
2
Consider upgrading your Kubernetes and Linux kernel versions to benefit from performance improvements and bug fixes.
Upgrading can resolve underlying issues that may not be immediately apparent and can lead to significant performance enhancements.
3
Implement a strategy for replacing older Kubernetes nodes periodically to maintain optimal performance.
This proactive approach can help ensure that your infrastructure remains efficient and responsive, especially in production environments.

Common Pitfalls

1
Failing to monitor and manage memory cgroups can lead to performance degradation over time.
As seen in the article, a buildup of memory cgroups can slow down processes significantly, making it crucial to implement monitoring strategies.
2
Neglecting to upgrade Kubernetes and Linux kernel versions can leave systems vulnerable to unresolved bugs.
Upgrades often contain important fixes that can enhance performance and stability, so staying current is essential.

Related Concepts

Kubernetes Performance Optimization
Mysql Initialization Processes
Linux Kernel Memory Management
Cloud Infrastructure Management