Overview
The article discusses how Uber's Observability team optimized their M3 metrics ingestion system by forking the Go compiler, ultimately halving the latency of metrics ingestion. It details the investigation into a performance regression, the methods used to identify the root cause, and the implementation of a new worker pool to improve performance.
What You'll Learn
1
How to diagnose performance regressions in Go applications
2
Why goroutine stack growth can impact performance in Go
3
How to implement a pooled worker pattern to optimize resource usage
Prerequisites & Requirements
- Understanding of Go programming language internals
- Experience with performance profiling tools(optional)
Key Questions Answered
How did Uber identify the root cause of the metrics ingestion latency issue?
Uber identified the root cause by using CPU profiling and git bisect to trace the performance regression to a specific commit. They discovered that a change in the Clone method was causing excessive stack growth, leading to doubled latency in metrics ingestion.
What changes were made to improve the M3 metrics ingestion performance?
To improve performance, Uber implemented a new worker pool that reused goroutines instead of creating new ones for each request. This change significantly reduced the time spent in the runtime.morestack function, leading to lower end-to-end latency.
What impact did the performance regression have on Uber's metrics ingestion?
The performance regression caused the P99 latency for metrics ingestion to increase from approximately 10 seconds to over 20 seconds, significantly affecting the loading times of Grafana dashboards and the responsiveness of automated alerts.
How did Uber's new worker pool affect the performance of the M3DB ingesters?
The new worker pool implementation reduced the average number of stack growth occurrences from 15,685 to just 171, demonstrating a substantial improvement in resource management and performance efficiency in the M3DB ingesters.
Key Statistics & Figures
P99 latency before regression
10 seconds
This was the average latency for metrics ingestion before the deployment that introduced the regression.
P99 latency after regression
over 20 seconds
This was the latency observed after the deployment that caused the performance regression.
Average occurrences of stack growth with regression
15,685
This was the average number of stack growth occurrences measured during the performance regression.
Average occurrences of stack growth with new worker pool
171
This was the average number of stack growth occurrences after implementing the new worker pool.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Programming Language
Go
Used for developing the M3 metrics ingestion system.
Database
M3db
Used for storing metrics data in Uber's observability platform.
Key Actionable Insights
1Implement a pooled worker pattern to manage goroutines effectively in high-load systems.This pattern can help mitigate the overhead of stack growth by reusing goroutines, which is especially beneficial in environments with fluctuating workloads.
2Utilize git bisect for efficient troubleshooting of performance regressions in complex codebases.This method allows engineers to pinpoint problematic commits even in large monorepos, facilitating quicker resolution of issues.
3Monitor stack growth in Go applications to prevent performance degradation.Understanding how the Go runtime manages goroutine stacks can help developers optimize their code and avoid unnecessary performance hits.
Common Pitfalls
1
Overlooking the impact of goroutine stack growth on performance.
Developers may not realize that excessive stack growth can lead to significant latency issues, especially in high-load scenarios. It's crucial to monitor and manage goroutine usage effectively.
Related Concepts
Performance Profiling
Concurrency Patterns In Go
Goroutine Management