Scaling AutoBuild: Our Journey Towards Delivering An Enhanced Customer Experience

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•18 min read•advanced•

--

•View Original

FlaskGrafanaHTTPSMySQLPrometheusRabbitMQ

Overview

The article discusses the evolution of AutoBuild, LinkedIn's automated server lifecycle management system, highlighting the challenges faced with its initial architecture and the improvements made to enhance its scalability and reliability. It details the transition to a new architecture that supports parallel processing and prioritization of requests, ultimately leading to a more resilient infrastructure.

What You'll Learn

1

How to implement a horizontally-scaled architecture for server management

2

Why prioritizing requests improves server response times

3

How to leverage MySQL for job processing in a distributed system

Prerequisites & Requirements

Understanding of server lifecycle management concepts
Familiarity with MySQL and RESTful APIs(optional)

Key Questions Answered

How does AutoBuild manage server lifecycle tasks?

AutoBuild operates through a RESTful API that allows for actions like building, rebooting, and decommissioning servers. It uses Active Directory-based authentication and a CRUD model to manage requests, ensuring efficient server lifecycle management.

What were the main issues with AutoBuild's initial architecture?

The initial architecture struggled with sequential processing, leading to message queue crashes and lost requests. It also lacked prioritization for urgent tasks, causing delays in reimaging requests, which were critical for user experience.

What improvements were made in the v2 architecture of AutoBuild?

The v2 architecture introduced parallel processing of requests, reduced reliance on external messaging queues, and implemented job prioritization. This allowed AutoBuild to scale more effectively and handle increased traffic demands.

How does the new architecture ensure high availability?

The new architecture uses clusters of nodes for AutoBuild servers, allowing for redundancy and failover capabilities. This setup minimizes single points of failure and ensures continuous operation even if individual nodes fail.

Key Statistics & Figures

Maximum number of concurrent requests processed

~3K per day per data center

This statistic reflects the current operational capacity of AutoBuild under the new architecture.

Expected reimage count based on internal experimentation

>60K for ~100 hosts in parallel

This demonstrates the system's capability to handle large volumes of requests simultaneously.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Used as a relational datastore for persisting incoming data and managing job processing.

Backend

Flask

Serves as the web framework for AutoBuild's API.

Backend

Gunicorn

Acts as the WSGI server for running the Flask application.

Messaging

Rabbitmq

Previously used for message queuing but was phased out in the new architecture.

Monitoring

Prometheus

Used for gathering metrics and telemetry data from the AutoBuild system.

Monitoring

Grafana

Provides visualization capabilities for the metrics collected by Prometheus.

Key Actionable Insights

1
Implementing a worker-pool management system can significantly enhance the reliability of server processes.
By ensuring that processes are automatically restarted upon failure, you can maintain consistent performance and reduce downtime in server management tasks.

2
Prioritizing requests based on urgency can lead to improved response times for critical operations.
This approach ensures that high-priority tasks, such as reimaging servers, are completed faster, enhancing overall user satisfaction and operational efficiency.

3
Leveraging MySQL's NOWAIT and SKIP LOCKED features can optimize job processing in a concurrent environment.
These features help manage job locks effectively, preventing race conditions and ensuring that multiple processes can operate on different jobs without conflict.

Common Pitfalls

1

Relying heavily on external messaging systems can create bottlenecks and single points of failure.

In the initial architecture, RabbitMQ caused crashes and lost messages, which hindered overall system performance. Transitioning to a more integrated approach with MySQL helped mitigate these issues.

Related Concepts

Server Lifecycle Management

Distributed Systems

Job Processing And Prioritization

High Availability Architectures