Overview
This article details the challenges faced by Pinterest during the migration of its search infrastructure, Manas, to Kubernetes. It highlights a specific performance issue where one in every million search requests took significantly longer due to an interaction between the memory-intensive search system and a monitoring process.
What You'll Learn
1
How to identify and resolve performance bottlenecks in Kubernetes environments
2
Why memory management is critical in high-performance applications
3
How to effectively debug latency issues in distributed systems
Prerequisites & Requirements
- Understanding of Kubernetes and distributed systems
- Familiarity with performance profiling tools like perf(optional)
Key Questions Answered
What performance issue did Pinterest encounter during the migration to Kubernetes?
Pinterest faced a significant performance issue where one in every million search requests took 100 times longer than usual, leading to timeouts in their search infrastructure, Manas. This was traced back to an interaction with cAdvisor, a monitoring tool that caused memory contention.
How did Pinterest identify the root cause of the latency spikes?
The team used a combination of clearbox and blackbox debugging techniques, including profiling CPU and memory usage and isolating the Manas pod from other processes. They eventually identified cAdvisor as the culprit after disabling it eliminated the latency spikes.
What changes were made to resolve the performance issue?
To resolve the performance issue, Pinterest disabled cAdvisor's working set size estimation feature across all PinCompute nodes. This simple change significantly improved the latency issues experienced during the migration.
Why is memory management important in high-performance applications?
Memory management is crucial in high-performance applications because inefficient memory handling can lead to significant latency spikes and performance degradation. In this case, the memory-intensive nature of the search system exacerbated the issue caused by cAdvisor's monitoring.
Key Statistics & Figures
Search request latency
100x longer than usual for one in every million requests
This significant increase in latency was a critical issue during the migration of Pinterest's search infrastructure.
Max serving latencies
up to 5 seconds
Expected normal max latencies were under 60ms, indicating a severe performance regression.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Orchestration
Kubernetes
Used to manage the deployment and scaling of the Manas search infrastructure.
Monitoring
Cadvisor
Used for exporting container-level metrics but caused performance issues due to its intrusive memory management.
Profiling
Perf
Utilized for CPU and memory profiling to identify performance bottlenecks.
Key Actionable Insights
1Implement rigorous performance testing before and after migrating to new infrastructure.This ensures that any potential issues are identified early, allowing for timely resolutions before they affect users.
2Utilize profiling tools to monitor CPU and memory usage during high-load scenarios.Profiling can help pinpoint bottlenecks and identify processes that may interfere with application performance, as seen with cAdvisor in this case.
3Consider the impact of monitoring tools on application performance.Monitoring tools can introduce overhead, especially in memory-intensive applications, so it's essential to assess their configuration and impact during deployment.
Common Pitfalls
1
Relying too heavily on monitoring tools without understanding their impact on performance.
This can lead to significant performance regressions, as seen with cAdvisor, which introduced latency spikes due to its memory management processes.
2
Failing to conduct thorough performance testing during infrastructure migrations.
Without proper testing, critical performance issues may go unnoticed until they affect end users, leading to degraded service quality.
Related Concepts
Kubernetes Architecture And Best Practices
Performance Profiling Techniques
Memory Management In Distributed Systems