Metal as a Service (MaaS): DIY server-management at scale

LinkedIn Engineering Team
17 min readadvanced
--
View Original

Overview

The article discusses Metal as a Service (MaaS), an internal tool developed by LinkedIn to manage server lifecycles at scale. It highlights the evolution of MaaS from a manual process reliant on Jira tickets to a self-service API that allows Site Reliability Engineers (SREs) to manage server upgrades and other operations efficiently.

What You'll Learn

1

How to use Metal as a Service (MaaS) for server lifecycle management

2

Why decoupling overlapping-hosts computation improves server upgrade efficiency

3

How to leverage Kafka for distributed messaging in server management

Prerequisites & Requirements

  • Understanding of server lifecycle management concepts
  • Familiarity with REST APIs and HTTP protocols

Key Questions Answered

What is Metal as a Service (MaaS) and how does it function?
Metal as a Service (MaaS) is an internal tool at LinkedIn that allows Site Reliability Engineers (SREs) to manage server lifecycles through a self-service API. It enables actions such as reimaging, rebooting, and decommissioning servers in batches, streamlining the process that previously relied on manual Jira ticket submissions.
How did MaaS evolve from its minimum viable product (MVP) stage?
Initially, MaaS could only process one request every two minutes due to its design constraints. As it evolved, the architecture was improved to handle multiple requests simultaneously, leveraging Kafka for distributed messaging and enhancing overall throughput and efficiency.
What challenges did MaaS face during its MVP phase?
During its MVP phase, MaaS faced issues such as lack of service redundancy, reliance on outdated local services, and a strict two-minute backoff for submissions. These challenges hindered its performance and user experience, prompting the need for architectural improvements.
What metrics are collected by MaaS to improve performance?
MaaS collects various metrics per submission, including the overall success rate of reimage requests and individual runtimes for different actions. This data is used to create dashboards that help monitor performance and identify areas for improvement.

Key Statistics & Figures

Maximum server submissions per day before improvements
72,000
Initially, MaaS could only handle a maximum of 72,000 server submissions per day, which has since been improved to have no limitations on acceptance rate.
Processing time for requests during MVP phase
2 minutes
During the MVP phase, MaaS could only process one request every two minutes, which was a significant limitation that needed to be addressed.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a self-service API for server management can significantly reduce operational delays.
By allowing SREs to directly manage server lifecycles, organizations can minimize the back-and-forth communication typically required in ticketing systems, leading to faster resolution times.
2
Leveraging distributed messaging systems like Kafka can enhance the scalability of server management tools.
Using Kafka allows for handling a high volume of requests without overwhelming the system, ensuring that server operations can be processed efficiently even as demand increases.
3
Regularly monitoring and analyzing performance metrics can lead to continuous improvement in system efficiency.
By utilizing telemetry data, teams can identify bottlenecks and optimize processes, ultimately improving the user experience and operational reliability.

Common Pitfalls

1
Relying on a single deployment node can lead to service interruptions during failures.
This issue arises because if the primary node fails, manual intervention is required to restore service. Transitioning to an active-active deployment scheme can mitigate this risk.
2
Using HTTP for API interactions exposes sensitive data to security risks.
Without HTTPS, credentials are transmitted in plaintext, making them vulnerable to interception. Implementing HTTPS is crucial for securing API communications.

Related Concepts

Server Lifecycle Management
Distributed Systems
API Design And Implementation
Performance Monitoring And Metrics