Overview
The article discusses how Uber employs machine learning to ensure the capacity safety of individual microservices, addressing challenges related to predicting service-level capacity requirements. It highlights the importance of accurate forecasting and the development of in-house tooling to monitor and test microservices effectively.
What You'll Learn
1
How to implement machine learning for capacity safety in microservices
2
Why accurate forecasting is essential for preventing outages
3
How to conduct capacity safety tests using machine learning forecasts
Prerequisites & Requirements
- Understanding of microservices architecture and reliability engineering concepts
- Familiarity with machine learning frameworks and API usage(optional)
Key Questions Answered
How does Uber ensure the capacity safety of its microservices?
Uber ensures capacity safety by using machine learning to forecast service metrics like requests per second, latency, and CPU usage. This allows teams to conduct capacity safety tests that help prevent outages by ensuring that microservices can handle historical peaks and forecasts of concurrent users without resource starvation.
What role does machine learning play in Uber's capacity safety measures?
Machine learning at Uber is used to forecast core service metrics, enabling accurate capacity safety tests. By analyzing historical data and trends, the system can predict potential capacity issues and help engineers ensure that microservices are provisioned correctly to handle expected loads.
What challenges does Uber face in predicting service-level capacity requirements?
Uber faces challenges in predicting service-level capacity due to the complexity of its global scale and microservice call patterns. Inaccurate predictions can lead to capacity-related outages, which significantly impact service reliability and user experience.
Key Statistics & Figures
wMAPE score for forecasts
10-15 percent
Service owners have found forecasts within this range to be more informative and reliable.
Technologies & Tools
Backend
M3
Used for storing and querying service metrics to support forecasting.
Storage
Dosa
An open-source storage solution built on Apache Cassandra for time series data.
Key Actionable Insights
1Implement localized, automated capacity safety tests for individual microservices to improve reliability.This approach allows for more accurate testing tailored to the specific usage patterns of each service, reducing the impact on ongoing deployments and enhancing developer experience.
2Utilize machine learning forecasts to inform capacity planning and resource allocation.By leveraging historical data and trends, teams can proactively adjust resources to meet anticipated demand, minimizing the risk of outages.
3Conduct regular backtesting of forecasting models to ensure their accuracy and reliability.This practice helps identify any discrepancies between expected and actual performance, allowing for timely adjustments to forecasting strategies.
Common Pitfalls
1
Failing to account for data center failovers can lead to inaccurate forecasts.
Data center failovers introduce anomalies in the data, which must be identified and removed to maintain the integrity of the forecasting models.
Related Concepts
Reliability Engineering
Microservices Architecture
Machine Learning Forecasting Techniques