Overview
This article discusses the debugging process of a deadlock issue encountered during the upgrade of the PininfoService from Ubuntu 14 to Ubuntu 18. It outlines the setup of test and canary environments, the design of tests with control variables, and the separation of root and leaf layers to identify the source of the problems.
What You'll Learn
1
How to set up a test environment for debugging service upgrades
2
Why separating root and leaf layers can help identify issues in service architecture
3
How to design tests using control variables to isolate issues
Prerequisites & Requirements
- Understanding of service architecture concepts, particularly root and leaf layers
- Familiarity with debugging tools like GDB(optional)
Key Questions Answered
What steps were taken to upgrade PininfoService from Ubuntu 14 to Ubuntu 18?
The upgrade process involved changing the service code for Ubuntu 18 compatibility, deploying the new service build with dark traffic, and performing an in-place upgrade of the existing instances. This method minimized disruption to service availability.
What issues were observed during the testing of the new service build?
During testing, two main issues were observed: a drop in queries per second (QPS) to 0 and inconsistent memory usage, with some hosts using less than 50GB while others spiked to around 500GB, leading to out-of-memory (OOM) errors.
How did the team isolate the source of the deadlock issue?
The team conducted experiments by isolating root-only and leaf-only nodes to determine which layer was responsible for the issues. Results indicated that the leaf logic was likely causing the process crashes and memory inconsistencies.
What is the significance of using control variables in testing?
Control variables allow the team to isolate specific factors that might be causing issues by keeping other variables constant. This method helps in identifying correlations between changes in configurations and observed problems.
Key Statistics & Figures
Memory usage on test hosts
Some hosts were stuck at <50GB while others increased to ~500GB
This inconsistency in memory usage was a significant factor leading to out-of-memory errors during testing.
QPS drop
QPS dropped to 0
This occurred during the testing phase of the new service build, indicating a critical issue that needed to be addressed.
Instances upgraded
>10,000 instances
The team successfully upgraded more than 10,000 instances of their stateful services from Ubuntu 14 to Ubuntu 18.
Technologies & Tools
Debugging Tool
Gdb
Used for debugging running processes to identify issues during the upgrade.
Key Actionable Insights
1Implementing a test environment can significantly reduce risks during service upgrades.By using a test environment, developers can validate new builds under controlled conditions before full deployment, minimizing potential disruptions to live services.
2Separating root and leaf layers in service architecture can help pinpoint issues more effectively.This approach allows engineers to test each layer independently, making it easier to identify where problems originate, thus streamlining the debugging process.
3Utilizing control variables in experiments can enhance the reliability of test results.By controlling for external factors, teams can better understand the impact of specific changes, leading to more accurate conclusions about system behavior.
Common Pitfalls
1
Failing to isolate different layers of service architecture can lead to confusion during debugging.
Without clear separation, it becomes challenging to determine which part of the system is causing issues, potentially leading to wasted time and resources in troubleshooting.
2
Not using control variables in tests can result in inconclusive results.
When multiple factors are altered simultaneously, it becomes difficult to ascertain which change led to a specific outcome, complicating the debugging process.
Related Concepts
Service Architecture Patterns
Debugging Techniques
Memory Management In Distributed Systems