Debugging Deadlock in PininfoService Ubuntu18 Upgrade: Part 2 of 2

Pinterest Engineering
8 min readintermediate
--
View Original

Overview

This article is the second part of a series focused on debugging deadlock issues in the PininfoService during an upgrade to Ubuntu 18. It details the identification of a deadlock caused by the GlobalCPUExecutor and the resolution steps taken to stabilize the service, including runtime parameter tuning and disabling a problematic feature.

What You'll Learn

1

How to analyze deadlock issues in a multi-threaded service

2

Why runtime parameter tuning is crucial for system stability

3

How to use GDB for debugging stuck processes

4

When to apply the Five Whys analysis for root cause investigation

Prerequisites & Requirements

  • Understanding of multi-threaded programming concepts
  • Familiarity with debugging tools like GDB and tcpdump

Key Questions Answered

What caused the QPS drop to zero in the PininfoService?
The QPS drop to zero was caused by a deadlock situation where the GlobalCPUExecutor was waiting on the ThriftClientPool, while the ThriftClientPool was also waiting on the GlobalCPUExecutor. This mutual blocking prevented any requests from being processed.
How was the deadlock in the GlobalCPUExecutor identified?
The deadlock was identified using GDB to probe the threads in the PininfoService, revealing that the GCPU thread was blocked while waiting to remove a ClientStatus, which was dependent on the ThriftClientPool that was also blocked.
What runtime configurations were optimized during the U18 rollout?
The optimized runtime configurations included disabling the dynamic CPUThreadPoolExecutor, which resolved the deadlock issues and stabilized the memory usage, allowing the service to handle requests without blocking.
What tools were used for memory usage debugging?
Memory usage debugging was performed using BPF tools, jemalloc, and tcmalloc to obtain heap dumps and analyze memory consumption patterns, which indicated that the GlobalCPUExecutor was consuming excessive heap memory.

Key Statistics & Figures

Active requests
130K
The number of active requests exceeded the maximum request threshold of 65K, leading to load shedding.
Memory usage increase
200GB
After disabling the dynamic CPUThreadPoolExecutor, memory usage stabilized at approximately 200GB.

Technologies & Tools

Debugging Tool
Gdb
Used to attach to the stuck PininfoService process and analyze thread states.
Network Analysis Tool
Tcpdump
Used to capture outgoing packet traces to analyze load shedding.
Memory Management
Jemalloc
Used for heap profiling to analyze memory usage patterns.
Memory Management
Tcmalloc
Also used for heap profiling, providing similar functionalities to jemalloc.

Key Actionable Insights

1
Utilize the Five Whys framework for root cause analysis in system issues.
This method helps to systematically uncover the underlying causes of problems, ensuring that solutions address the root rather than just the symptoms.
2
Regularly monitor thread states and resource usage in multi-threaded applications.
Using tools like GDB and tcpdump can help identify performance bottlenecks and deadlocks early, allowing for proactive resolution before they impact service availability.
3
Consider disabling dynamic features in production if they introduce instability.
In this case, disabling the dynamic CPUThreadPoolExecutor resolved deadlock issues, highlighting the importance of evaluating new features against system stability.
4
Document and analyze heap usage patterns to identify potential memory leaks.
Using tools like jemalloc and tcmalloc can provide insights into memory allocation, helping to optimize resource usage and prevent performance degradation.

Common Pitfalls

1
Failing to monitor thread states can lead to undetected deadlocks.
Without regular monitoring, issues may escalate, causing significant service disruptions. Implementing proactive monitoring can help catch these issues early.
2
Overlooking the impact of new features on system stability.
New features may introduce unforeseen complexities. It's crucial to evaluate their effects in a controlled environment before rolling them out to production.

Related Concepts

Root Cause Analysis Techniques
Multi-threaded Programming Best Practices
Memory Management Strategies In High-load Systems