Scaling Salt for Remote Execution to support LinkedIn Infra growth

LinkedIn Engineering Team
11 min readintermediate
--
View Original

Overview

The article discusses how LinkedIn scaled its Salt infrastructure to support its growing needs for remote execution jobs, achieving a tenfold increase in job capacity and improved reliability. It details the architectural changes made, including the introduction of new products and a restructured Salt ecosystem.

What You'll Learn

1

How to scale Salt infrastructure for remote execution jobs

2

Why using a master-minion architecture is beneficial for task automation

3

How to implement custom modules for enhanced functionality in Salt

Prerequisites & Requirements

  • Understanding of Salt architecture and its components
  • Familiarity with Python and REST APIs(optional)

Key Questions Answered

How did LinkedIn scale its Salt infrastructure for remote execution?
LinkedIn scaled its Salt infrastructure by restructuring its architecture, introducing multiple li-salt-master instances, and integrating custom modules to enhance performance. This allowed the system to handle over 15,000 remote execution jobs daily across its server fleet, significantly improving reliability and scalability.
What challenges did LinkedIn face with its previous Salt setup?
The previous Salt setup faced challenges such as high load on a single master handling over 60,000 minions, leading to downtime and operational inefficiencies. Issues included poor code coverage, manual failover management, and complex configurations that hindered performance.
What are the new products developed for Salt at LinkedIn?
LinkedIn developed five new Python multiproducts, including li-salt-master for orchestrating minions and exposing REST APIs, and li-minion, an installable agent that configures itself on hosts. These products enhance the Salt ecosystem's functionality and security.

Key Statistics & Figures

Increase in remote execution jobs
10x
Achieved by scaling the Salt infrastructure to support more jobs with improved reliability.
Number of remote jobs executed daily
15,000
The new architecture supports executing over 15,000 remote jobs across LinkedIn's fleet of servers.
Minions per master in old setup
65,000
The previous single master setup managed nearly 65,000 minions, leading to performance issues.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Integrate Salt with existing CI/CD pipelines to streamline deployment workflows.
By leveraging Salt's capabilities within CI/CD processes, teams can automate configuration management and deployment tasks, reducing manual errors and increasing deployment speed.
2
Implement custom modules to enhance Salt's functionality for specific use cases.
Custom modules can address unique operational challenges and improve the overall performance of the Salt infrastructure, making it more adaptable to LinkedIn's evolving needs.
3
Monitor Salt performance metrics using a centralized logging system.
Utilizing tools like Apache Kafka for log streaming allows for real-time monitoring and analysis of Salt operations, enabling proactive issue resolution.

Common Pitfalls

1
Overloading a single Salt master with too many minions can lead to performance degradation.
This occurs when the master cannot handle the load, resulting in downtime and operational challenges. Distributing the load across multiple masters can mitigate this issue.
2
Neglecting security measures for client modules can expose vulnerabilities.
Without proper security checks and module ownership, there is a risk of executing unsafe code. Implementing strict ACLs and security audits can help ensure safe operations.

Related Concepts

Salt Architecture And Design Patterns
Remote Execution Strategies
CI/CD Integration Best Practices