Overview
The article details a complex debugging journey faced by ClickHouse engineers as they investigated a mysterious CPU spike in their cloud infrastructure on GCP. Through various tools and techniques, including eBPF tracing and performance analysis, they uncovered a deeper issue within the Linux kernel’s memory management that led to a hidden livelock, ultimately resulting in a fix that was later challenged by a new kernel bug.
What You'll Learn
1
How to use eBPF tools for kernel debugging
2
Why understanding memory management in Linux is crucial for performance
3
How to create reproducible test cases for kernel-related issues
4
When to escalate kernel issues to cloud provider support
Prerequisites & Requirements
- Understanding of Linux kernel internals and memory management concepts
- Familiarity with debugging tools like eBPF and perf(optional)
Key Questions Answered
What caused the CPU spikes in ClickHouse Cloud on GCP?
The CPU spikes were caused by a hidden livelock in the Linux kernel's memory management system, specifically related to the handling of page faults and memory reclamation processes. This issue was exacerbated by high thread contention and the way the kernel managed memory under pressure.
How can eBPF be used to diagnose kernel issues?
eBPF can be used to trace system calls and monitor kernel behavior in real-time, allowing engineers to gather insights into performance bottlenecks and identify problematic areas in the kernel. This was crucial in diagnosing the livelock issue and understanding its impact on CPU usage.
Why did the issue only occur on GCP and not on AWS or Azure?
The issue was specific to GCP due to differences in workload patterns and how the Google Cloud Platform's Container-Optimized OS managed memory and kernel operations. The kernel version and its configuration played a significant role in the manifestation of the bug.
What steps were taken to investigate the kernel bug?
The investigation involved using tools like gdb for stack traces, perf for profiling CPU usage, and bpftrace for tracing system calls. Engineers compiled a runbook to standardize diagnostic methods and gathered extensive data to identify the root cause of the performance issues.
Key Statistics & Figures
Number of threads waiting to enter critical sections
More than a thousand
This high number of threads contributed to the performance degradation and unresponsiveness of ClickHouse instances.
CPU usage during the issue
30 CPUs fully occupied
This indicates that all available CPU resources were consumed by kernel processes handling page faults.
Page faults observed during testing
Only 90 in 10 seconds
Despite high CPU usage, the page faults were occurring at an unusually slow rate, indicating inefficiency in memory management.
Technologies & Tools
Backend
Ebpf
Used for tracing and monitoring kernel behavior to diagnose performance issues.
Backend
Perf
Utilized for profiling CPU usage and identifying where CPU time is spent during the performance issues.
Tools
Gdb
Used for collecting stack traces from the ClickHouse server process during hangs.
Key Actionable Insights
1Utilize eBPF for real-time tracing of kernel behavior to identify performance bottlenecks.eBPF provides powerful capabilities for monitoring and debugging kernel operations, making it an essential tool for engineers dealing with performance issues in production environments.
2Create a comprehensive runbook for on-call engineers to standardize troubleshooting steps.Having a well-documented runbook allows for quicker response times during incidents and ensures that all engineers follow the same diagnostic procedures, reducing downtime.
3Investigate kernel-related issues with a focus on memory management and reclaiming processes.Understanding how the kernel handles memory can provide insights into performance issues, especially under high load or contention scenarios.
Common Pitfalls
1
Failing to recognize the difference between symptoms and root causes during debugging.
Many engineers may focus on immediate symptoms like high CPU usage without investigating the underlying issues, which can lead to misdiagnosis and ineffective solutions.
2
Over-reliance on a single debugging tool.
Using only one tool like gdb or perf can limit visibility into the problem. A combination of tools is often necessary to get a complete picture of the issue.
Related Concepts
Linux Kernel Memory Management
Ebpf And Its Applications
Performance Optimization Techniques
Container Orchestration And Management