Noisy Neighbor Detection with eBPF

Netflix Technology Blog
10 min readadvanced
--
View Original

Overview

The article discusses how Netflix's Compute and Performance Engineering teams utilized eBPF for continuous, low-overhead monitoring of the Linux scheduler to detect noisy neighbor issues in their multi-tenant environment. By instrumenting run queue latency, they achieved deeper insights into performance degradation caused by resource contention among containers.

What You'll Learn

1

How to leverage eBPF for continuous monitoring of system performance

2

Why run queue latency is critical for identifying noisy neighbor issues

3

How to implement rate limiting in eBPF to manage data sampling

Prerequisites & Requirements

  • Understanding of Linux scheduling and containerization concepts
  • Familiarity with eBPF and its programming model(optional)

Key Questions Answered

How does eBPF help in detecting noisy neighbor issues?
eBPF allows for continuous, low-overhead instrumentation of the Linux scheduler, enabling real-time monitoring of run queue latency. This helps identify when a container is being affected by noisy neighbors, as it provides insights into CPU resource contention without significant performance degradation.
What metrics are essential for identifying noisy neighbors?
The key metrics for identifying noisy neighbors are run queue latency and the sched.switch.out metric. An increase in both metrics, particularly when caused by a different container or system process, indicates a noisy neighbor issue, while spikes in run queue latency alone may not.
What challenges exist in detecting noisy neighbor problems?
Detecting noisy neighbor problems is challenging due to the complexity of traditional performance analysis tools, which can introduce overhead. Additionally, debugging these issues requires low-level expertise and specialized tooling, making it difficult to pinpoint the source of performance degradation.

Key Statistics & Figures

99th percentile run queue latency
83.4µs
This serves as the baseline for a container not contending for CPU on a host.
Spike in run queue latency
131 milliseconds
This spike occurred when launching a container that fully utilized all CPUs on the host.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Ebpf
Used for continuous monitoring of the Linux scheduler and performance metrics.
Backend
Go
The userspace application processes events from the eBPF ring buffer.

Key Actionable Insights

1
Implement continuous monitoring of run queue latency using eBPF to catch performance issues early.
This approach allows for real-time insights into resource contention, enabling teams to address performance degradation proactively before it impacts users.
2
Utilize rate limiting in eBPF to manage the volume of data collected, ensuring that userspace applications do not become CPU-bound.
By controlling the data flow, you can maintain application performance while still gathering valuable metrics for analysis.
3
Combine run queue latency metrics with sched.switch.out metrics to accurately diagnose noisy neighbor issues.
This dual-metric approach helps differentiate between actual noisy neighbors and performance issues caused by containers hitting their CPU limits.

Common Pitfalls

1
Relying solely on run queue latency metrics can lead to misconceptions about performance issues.
If a container is at or over its CPU limit, the scheduler throttles it, which can cause spikes in run queue latency that may be misattributed to noisy neighbors.

Related Concepts

Ebpf
Linux Scheduling
Containerization
Performance Monitoring