•Kurtis Nusbaum, Tim Miller, Brandon Bercovich, Bharath Siravara•16 min read•advanced•
--
•View OriginalOverview
The article discusses Uber's next-generation infrastructure stack, named Crane, which aims to modernize and scale Uber's infrastructure for a hybrid, multi-cloud environment. It covers the challenges faced with the legacy system, the design principles behind Crane, and the key features that enhance operational efficiency and automation.
What You'll Learn
1
How to automate zone turn-up processes for infrastructure
2
Why host homogeneity is crucial for operational efficiency
3
How to implement centralized bad host detection and remediation
4
When to use Infrastructure-as-Code for configuration management
Prerequisites & Requirements
- Understanding of cloud infrastructure concepts
- Familiarity with Infrastructure-as-Code tools(optional)
Key Questions Answered
What challenges did Uber face with its legacy infrastructure?
Uber faced challenges such as rapid growth of server fleet, manual operations leading to outages, and limited cloud capacity utilization. These issues necessitated a reimagining of their infrastructure to support scalability and efficiency.
How does Crane automate the zone turn-up process?
Crane automates the zone turn-up process by utilizing Infrastructure-as-Code components that can be executed from an engineer's laptop. This allows for rapid provisioning of new zones, reducing the time taken from months to just a few days.
What is the role of the Bad Host Detector in Crane?
The Bad Host Detector (BHD) centralizes the detection of hardware issues across all zones. It scans hosts for problems and coordinates remediation actions, ensuring that faulty hardware is quickly addressed without manual intervention.
How does Crane ensure host homogeneity?
Crane enforces host homogeneity by standardizing the operating system across all servers, ensuring that only essential services are included. This reduces complexity and improves the ability to perform fleet-wide updates and troubleshooting.
Key Statistics & Figures
Time to turn up a new zone
3 days
This is the current record time for bootstrapping zonal infrastructure using Crane's automated tools.
Server fleet size
100,000+ servers
Crane's tooling is designed to support a fleet of over 100,000 servers, highlighting its scalability.
Technologies & Tools
Configuration Management
Starlark
Used for developing Infrastructure-as-Code components that automate zone turn-up processes.
Key Actionable Insights
1Implementing Infrastructure-as-Code can significantly reduce the time required for provisioning new infrastructure.By automating the zone turn-up process, teams can focus on higher-value tasks rather than manual configurations, leading to improved operational efficiency.
2Centralizing bad host detection can prevent duplicated efforts across teams.With a unified approach to detecting hardware issues, teams can avoid confusion and streamline their remediation processes, enhancing overall system reliability.
3Standardizing the operating system across all hosts simplifies maintenance and upgrades.This approach minimizes the risk of errors during updates and allows for quicker responses to security vulnerabilities, ensuring a more secure infrastructure.
Common Pitfalls
1
Failing to standardize host configurations can lead to operational inefficiencies.
Without a uniform operating system across all servers, teams may struggle with inconsistent environments, making updates and troubleshooting more complex.
Related Concepts
Infrastructure-as-code
Cloud Infrastructure Management
Automation In Devops