Building an Auto-Scaling Lambda based on Github's Workflow Job Queue

Misha Shiryaev
3 min readbeginner
--
View Original

Overview

This article discusses the implementation of an auto-scaling Lambda function based on GitHub's workflow job queue using ClickHouse. It highlights the transition from a reactive system to a more proactive approach in managing GitHub runners, resulting in improved job launching responsiveness and resource savings.

What You'll Learn

1

How to implement an auto-scaling Lambda function for GitHub workflow jobs

2

Why proactive scaling improves job responsiveness in CI/CD pipelines

3

When to scale up or down based on job queue metrics

Prerequisites & Requirements

  • Understanding of AWS Lambda and auto-scaling concepts
  • Familiarity with ClickHouse and GitHub Actions(optional)

Key Questions Answered

How does the new auto-scaling system improve job launching?
The new system allows for proactive scaling of GitHub runners based on real-time job queue data, significantly reducing the time needed to respond to increased demand. This results in faster job launches and more efficient resource usage, as runners are scaled up or down based on current workload metrics.
What metrics are used to determine scaling actions?
The scaling actions are based on the current number of runners, jobs in progress, and the queue size. A deficit in runners leads to an increase in capacity, while unnecessary reserves are reduced, optimizing resource allocation.
What was the limitation of the old scaling system?
The old system was reactive, requiring more than 5 minutes of busy state for scaling up and often taking hours to warm up a large number of runners. This resulted in slower job launches and inefficient resource usage.

Key Statistics & Figures

Percentage of busy runners for scaling up
More than 97%
This threshold was used in the old system to trigger the addition of more runners.
Percentage of busy runners for scaling down
Less than 70%
This threshold was used in the old system to trigger the removal of runners.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a proactive scaling strategy based on real-time metrics to optimize resource usage.
By analyzing job queue data, teams can adjust their runner capacity dynamically, leading to faster job execution and reduced costs.
2
Utilize ClickHouse for efficient data storage and querying of GitHub Actions metrics.
ClickHouse's capabilities allow for quick insights into workflow job statuses, enabling better decision-making for scaling actions.
3
Regularly review and adjust scaling thresholds to align with changing workload patterns.
As project demands evolve, maintaining optimal scaling thresholds ensures that resources are neither over-provisioned nor under-utilized.

Common Pitfalls

1
Relying solely on reactive scaling can lead to inefficient resource usage.
This happens because waiting for specific thresholds can delay job execution and lead to wasted compute resources. Transitioning to a proactive approach allows for timely adjustments based on real-time data.

Related Concepts

Auto-scaling Strategies
CI/CD Optimization
Github Actions Management