Using Machine Learning to Ensure the Capacity Safety of Individual Microservices

Ranjib Dey, Shrey Desai, Ruogu Du
13 min readadvanced
--
View Original

Overview

The article discusses how Uber employs machine learning to ensure the capacity safety of individual microservices, addressing challenges related to predicting service-level capacity requirements. It highlights the importance of accurate forecasting and the development of in-house tooling to monitor and test microservices effectively.

What You'll Learn

1

How to implement machine learning for capacity safety in microservices

2

Why accurate forecasting is essential for preventing outages

3

How to conduct capacity safety tests using machine learning forecasts

Prerequisites & Requirements

  • Understanding of microservices architecture and reliability engineering concepts
  • Familiarity with machine learning frameworks and API usage(optional)

Key Questions Answered

How does Uber ensure the capacity safety of its microservices?
Uber ensures capacity safety by using machine learning to forecast service metrics like requests per second, latency, and CPU usage. This allows teams to conduct capacity safety tests that help prevent outages by ensuring that microservices can handle historical peaks and forecasts of concurrent users without resource starvation.
What role does machine learning play in Uber's capacity safety measures?
Machine learning at Uber is used to forecast core service metrics, enabling accurate capacity safety tests. By analyzing historical data and trends, the system can predict potential capacity issues and help engineers ensure that microservices are provisioned correctly to handle expected loads.
What challenges does Uber face in predicting service-level capacity requirements?
Uber faces challenges in predicting service-level capacity due to the complexity of its global scale and microservice call patterns. Inaccurate predictions can lead to capacity-related outages, which significantly impact service reliability and user experience.

Key Statistics & Figures

wMAPE score for forecasts
10-15 percent
Service owners have found forecasts within this range to be more informative and reliable.

Technologies & Tools

Backend
M3
Used for storing and querying service metrics to support forecasting.
Storage
Dosa
An open-source storage solution built on Apache Cassandra for time series data.

Key Actionable Insights

1
Implement localized, automated capacity safety tests for individual microservices to improve reliability.
This approach allows for more accurate testing tailored to the specific usage patterns of each service, reducing the impact on ongoing deployments and enhancing developer experience.
2
Utilize machine learning forecasts to inform capacity planning and resource allocation.
By leveraging historical data and trends, teams can proactively adjust resources to meet anticipated demand, minimizing the risk of outages.
3
Conduct regular backtesting of forecasting models to ensure their accuracy and reliability.
This practice helps identify any discrepancies between expected and actual performance, allowing for timely adjustments to forecasting strategies.

Common Pitfalls

1
Failing to account for data center failovers can lead to inaccurate forecasts.
Data center failovers introduce anomalies in the data, which must be identified and removed to maintain the integrity of the forecasting models.

Related Concepts

Reliability Engineering
Microservices Architecture
Machine Learning Forecasting Techniques