99% to 99.9% SLO: High Performance Kubernetes Control Plane at Pinterest

Pinterest Engineering
9 min readintermediate
--
View Original

Overview

The article discusses the journey of the Cloud Runtime team at Pinterest in enhancing the performance of their Kubernetes control plane, achieving a Service Level Objective (SLO) increase from 99% to 99.9%. It details the importance of control plane latency, the challenges faced, and the optimizations implemented to improve worker pool efficiency and leadership switch times.

What You'll Learn

1

How to measure control plane performance using SLI and SLO metrics

2

Why optimizing worker pool efficiency is crucial for Kubernetes control plane performance

3

How to implement a proactive leadership handover in Kubernetes

Prerequisites & Requirements

  • Understanding of Kubernetes control plane architecture
  • Experience with Kubernetes and cloud infrastructure(optional)

Key Questions Answered

What metrics are used to measure control plane performance at Pinterest?
Pinterest measures control plane performance using Service Level Indicators (SLIs) and Service Level Objectives (SLOs), specifically focusing on reconcile latency, which is the time taken from when a user change is received to when it propagates out of the control plane.
What challenges did Pinterest face in improving control plane performance?
Pinterest faced challenges such as worker pool efficiency, where spikes in queue depth caused head-of-line blocking, and leadership switch times, which negatively impacted control plane performance during deployments or pod evictions.
How did Pinterest reduce leadership switch time in their control plane?
Pinterest reduced leadership switch time by making the backoff interval for leader election configurable, preloading informer caches in standby controller pods, and implementing readiness probes to ensure graceful rolling upgrades, resulting in an average switch time decrease from 64 seconds to 10 seconds.

Key Statistics & Figures

Control plane SLO
99.9%
This is the current SLO achieved by Pinterest's control plane, improved from an initial 99%.
Average queue depth reduction
97%
The average queue depth during informer periodic resync has been reduced from 1,000 to 30.
Leadership switch time improvement
85%
The average control plane leadership switch time decreased from 64 seconds to 10 seconds.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration
Kubernetes
Used as the platform for managing containerized applications at Pinterest.
Backend
Controller Framework
Utilized for writing resource controllers within the Kubernetes control plane.
Backend
Workqueue Package
Provides metrics for gaining insights into worker pool efficiency.

Key Actionable Insights

1
Prioritize user-triggered events in your Kubernetes control plane to enhance performance.
By categorizing events based on their urgency, you can prevent head-of-line blocking and ensure that critical workloads are processed promptly.
2
Implement proactive leadership handover to minimize downtime during leadership switches.
This approach allows the current leader to release locks before exiting, significantly reducing the time spent in leadership transitions.
3
Monitor leadership switch metrics to identify and address performance bottlenecks.
Having detailed metrics on leadership transitions helps in understanding and optimizing the control plane's responsiveness.

Common Pitfalls

1
Failing to prioritize user-triggered events can lead to performance degradation.
When user-triggered events are not handled promptly, it can cause delays in processing critical workloads, negatively impacting overall system performance.
2
Ignoring the impact of leadership switch times can result in significant downtime.
If leadership transitions are not optimized, they can lead to prolonged periods without a leader, causing delays in workload reconciliation.

Related Concepts

Kubernetes Control Plane Architecture
Service Level Indicators And Objectives
Leader Election Mechanisms In Distributed Systems