Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by (Briefly) Forking the Go Compiler

Richard Artoul

Uber

•

Richard Artoul

•18 min read•advanced•

--

•View Original

CockroachDBGrafana

Overview

The article discusses how Uber's Observability team optimized their M3 metrics ingestion system by forking the Go compiler, ultimately halving the latency of metrics ingestion. It details the investigation into a performance regression, the methods used to identify the root cause, and the implementation of a new worker pool to improve performance.

What You'll Learn

1

How to diagnose performance regressions in Go applications

2

Why goroutine stack growth can impact performance in Go

3

How to implement a pooled worker pattern to optimize resource usage

Prerequisites & Requirements

Understanding of Go programming language internals
Experience with performance profiling tools(optional)

Key Questions Answered

How did Uber identify the root cause of the metrics ingestion latency issue?

Uber identified the root cause by using CPU profiling and git bisect to trace the performance regression to a specific commit. They discovered that a change in the Clone method was causing excessive stack growth, leading to doubled latency in metrics ingestion.

What changes were made to improve the M3 metrics ingestion performance?

To improve performance, Uber implemented a new worker pool that reused goroutines instead of creating new ones for each request. This change significantly reduced the time spent in the runtime.morestack function, leading to lower end-to-end latency.

What impact did the performance regression have on Uber's metrics ingestion?

The performance regression caused the P99 latency for metrics ingestion to increase from approximately 10 seconds to over 20 seconds, significantly affecting the loading times of Grafana dashboards and the responsiveness of automated alerts.

How did Uber's new worker pool affect the performance of the M3DB ingesters?

The new worker pool implementation reduced the average number of stack growth occurrences from 15,685 to just 171, demonstrating a substantial improvement in resource management and performance efficiency in the M3DB ingesters.

Key Statistics & Figures

P99 latency before regression

10 seconds

This was the average latency for metrics ingestion before the deployment that introduced the regression.

P99 latency after regression

over 20 seconds

This was the latency observed after the deployment that caused the performance regression.

Average occurrences of stack growth with regression

15,685

This was the average number of stack growth occurrences measured during the performance regression.

Average occurrences of stack growth with new worker pool

171

This was the average number of stack growth occurrences after implementing the new worker pool.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language

Go

Used for developing the M3 metrics ingestion system.

Database

M3db

Used for storing metrics data in Uber's observability platform.

Key Actionable Insights

1
Implement a pooled worker pattern to manage goroutines effectively in high-load systems.
This pattern can help mitigate the overhead of stack growth by reusing goroutines, which is especially beneficial in environments with fluctuating workloads.

2
Utilize git bisect for efficient troubleshooting of performance regressions in complex codebases.
This method allows engineers to pinpoint problematic commits even in large monorepos, facilitating quicker resolution of issues.

3
Monitor stack growth in Go applications to prevent performance degradation.
Understanding how the Go runtime manages goroutine stacks can help developers optimize their code and avoid unnecessary performance hits.

Common Pitfalls

1

Overlooking the impact of goroutine stack growth on performance.

Developers may not realize that excessive stack growth can lead to significant latency issues, especially in high-load scenarios. It's crucial to monitor and manage goroutine usage effectively.

Related Concepts

Performance Profiling

Concurrency Patterns In Go

Goroutine Management