Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

Pinterest Engineering

•

Pinterest Engineering

•10 min read•advanced•

--

•View Original

EnvoyKubernetes

Overview

This article details the challenges faced by Pinterest during the migration of its search infrastructure, Manas, to Kubernetes. It highlights a specific performance issue where one in every million search requests took significantly longer due to an interaction between the memory-intensive search system and a monitoring process.

What You'll Learn

1

How to identify and resolve performance bottlenecks in Kubernetes environments

2

Why memory management is critical in high-performance applications

3

How to effectively debug latency issues in distributed systems

Prerequisites & Requirements

Understanding of Kubernetes and distributed systems
Familiarity with performance profiling tools like perf(optional)

Key Questions Answered

What performance issue did Pinterest encounter during the migration to Kubernetes?

Pinterest faced a significant performance issue where one in every million search requests took 100 times longer than usual, leading to timeouts in their search infrastructure, Manas. This was traced back to an interaction with cAdvisor, a monitoring tool that caused memory contention.

How did Pinterest identify the root cause of the latency spikes?

The team used a combination of clearbox and blackbox debugging techniques, including profiling CPU and memory usage and isolating the Manas pod from other processes. They eventually identified cAdvisor as the culprit after disabling it eliminated the latency spikes.

What changes were made to resolve the performance issue?

To resolve the performance issue, Pinterest disabled cAdvisor's working set size estimation feature across all PinCompute nodes. This simple change significantly improved the latency issues experienced during the migration.

Why is memory management important in high-performance applications?

Memory management is crucial in high-performance applications because inefficient memory handling can lead to significant latency spikes and performance degradation. In this case, the memory-intensive nature of the search system exacerbated the issue caused by cAdvisor's monitoring.

Key Statistics & Figures

Search request latency

100x longer than usual for one in every million requests

This significant increase in latency was a critical issue during the migration of Pinterest's search infrastructure.

Max serving latencies

up to 5 seconds

Expected normal max latencies were under 60ms, indicating a severe performance regression.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used to manage the deployment and scaling of the Manas search infrastructure.

Monitoring

Cadvisor

Used for exporting container-level metrics but caused performance issues due to its intrusive memory management.

Profiling

Perf

Utilized for CPU and memory profiling to identify performance bottlenecks.

Key Actionable Insights

1
Implement rigorous performance testing before and after migrating to new infrastructure.
This ensures that any potential issues are identified early, allowing for timely resolutions before they affect users.

2
Utilize profiling tools to monitor CPU and memory usage during high-load scenarios.
Profiling can help pinpoint bottlenecks and identify processes that may interfere with application performance, as seen with cAdvisor in this case.

3
Consider the impact of monitoring tools on application performance.
Monitoring tools can introduce overhead, especially in memory-intensive applications, so it's essential to assess their configuration and impact during deployment.

Common Pitfalls

1

Relying too heavily on monitoring tools without understanding their impact on performance.

This can lead to significant performance regressions, as seen with cAdvisor, which introduced latency spikes due to its memory management processes.

2

Failing to conduct thorough performance testing during infrastructure migrations.

Without proper testing, critical performance issues may go unnoticed until they affect end users, leading to degraded service quality.

Related Concepts

Kubernetes Architecture And Best Practices

Performance Profiling Techniques

Memory Management In Distributed Systems