Overview
This article is the second part of a series focused on debugging deadlock issues in the PininfoService during an upgrade to Ubuntu 18. It details the identification of a deadlock caused by the GlobalCPUExecutor and the resolution steps taken to stabilize the service, including runtime parameter tuning and disabling a problematic feature.
What You'll Learn
1
How to analyze deadlock issues in a multi-threaded service
2
Why runtime parameter tuning is crucial for system stability
3
How to use GDB for debugging stuck processes
4
When to apply the Five Whys analysis for root cause investigation
Prerequisites & Requirements
- Understanding of multi-threaded programming concepts
- Familiarity with debugging tools like GDB and tcpdump
Key Questions Answered
What caused the QPS drop to zero in the PininfoService?
The QPS drop to zero was caused by a deadlock situation where the GlobalCPUExecutor was waiting on the ThriftClientPool, while the ThriftClientPool was also waiting on the GlobalCPUExecutor. This mutual blocking prevented any requests from being processed.
How was the deadlock in the GlobalCPUExecutor identified?
The deadlock was identified using GDB to probe the threads in the PininfoService, revealing that the GCPU thread was blocked while waiting to remove a ClientStatus, which was dependent on the ThriftClientPool that was also blocked.
What runtime configurations were optimized during the U18 rollout?
The optimized runtime configurations included disabling the dynamic CPUThreadPoolExecutor, which resolved the deadlock issues and stabilized the memory usage, allowing the service to handle requests without blocking.
What tools were used for memory usage debugging?
Memory usage debugging was performed using BPF tools, jemalloc, and tcmalloc to obtain heap dumps and analyze memory consumption patterns, which indicated that the GlobalCPUExecutor was consuming excessive heap memory.
Key Statistics & Figures
Active requests
130K
The number of active requests exceeded the maximum request threshold of 65K, leading to load shedding.
Memory usage increase
200GB
After disabling the dynamic CPUThreadPoolExecutor, memory usage stabilized at approximately 200GB.
Technologies & Tools
Debugging Tool
Gdb
Used to attach to the stuck PininfoService process and analyze thread states.
Network Analysis Tool
Tcpdump
Used to capture outgoing packet traces to analyze load shedding.
Memory Management
Jemalloc
Used for heap profiling to analyze memory usage patterns.
Memory Management
Tcmalloc
Also used for heap profiling, providing similar functionalities to jemalloc.
Key Actionable Insights
1Utilize the Five Whys framework for root cause analysis in system issues.This method helps to systematically uncover the underlying causes of problems, ensuring that solutions address the root rather than just the symptoms.
2Regularly monitor thread states and resource usage in multi-threaded applications.Using tools like GDB and tcpdump can help identify performance bottlenecks and deadlocks early, allowing for proactive resolution before they impact service availability.
3Consider disabling dynamic features in production if they introduce instability.In this case, disabling the dynamic CPUThreadPoolExecutor resolved deadlock issues, highlighting the importance of evaluating new features against system stability.
4Document and analyze heap usage patterns to identify potential memory leaks.Using tools like jemalloc and tcmalloc can provide insights into memory allocation, helping to optimize resource usage and prevent performance degradation.
Common Pitfalls
1
Failing to monitor thread states can lead to undetected deadlocks.
Without regular monitoring, issues may escalate, causing significant service disruptions. Implementing proactive monitoring can help catch these issues early.
2
Overlooking the impact of new features on system stability.
New features may introduce unforeseen complexities. It's crucial to evaluate their effects in a controlled environment before rolling them out to production.
Related Concepts
Root Cause Analysis Techniques
Multi-threaded Programming Best Practices
Memory Management Strategies In High-load Systems