Building an Auto-Scaling Lambda based on Github's Workflow Job Queue

Misha Shiryaev

ClickHouse

•

Misha Shiryaev

•3 min read•beginner•

--

•View Original

AWSGitHub Actions

Overview

This article discusses the implementation of an auto-scaling Lambda function based on GitHub's workflow job queue using ClickHouse. It highlights the transition from a reactive system to a more proactive approach in managing GitHub runners, resulting in improved job launching responsiveness and resource savings.

What You'll Learn

1

How to implement an auto-scaling Lambda function for GitHub workflow jobs

2

Why proactive scaling improves job responsiveness in CI/CD pipelines

3

When to scale up or down based on job queue metrics

Prerequisites & Requirements

Understanding of AWS Lambda and auto-scaling concepts
Familiarity with ClickHouse and GitHub Actions(optional)

Key Questions Answered

How does the new auto-scaling system improve job launching?

The new system allows for proactive scaling of GitHub runners based on real-time job queue data, significantly reducing the time needed to respond to increased demand. This results in faster job launches and more efficient resource usage, as runners are scaled up or down based on current workload metrics.

What metrics are used to determine scaling actions?

The scaling actions are based on the current number of runners, jobs in progress, and the queue size. A deficit in runners leads to an increase in capacity, while unnecessary reserves are reduced, optimizing resource allocation.

What was the limitation of the old scaling system?

The old system was reactive, requiring more than 5 minutes of busy state for scaling up and often taking hours to warm up a large number of runners. This resulted in slower job launches and inefficient resource usage.

Key Statistics & Figures

Percentage of busy runners for scaling up

More than 97%

This threshold was used in the old system to trigger the addition of more runners.

Percentage of busy runners for scaling down

Less than 70%

This threshold was used in the old system to trigger the removal of runners.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Clickhouse

Used to store and query workflow job data from GitHub Actions.

Backend

AWS Lambda

Used for implementing the auto-scaling functionality based on job queue metrics.

Key Actionable Insights

1
Implement a proactive scaling strategy based on real-time metrics to optimize resource usage.
By analyzing job queue data, teams can adjust their runner capacity dynamically, leading to faster job execution and reduced costs.

2
Utilize ClickHouse for efficient data storage and querying of GitHub Actions metrics.
ClickHouse's capabilities allow for quick insights into workflow job statuses, enabling better decision-making for scaling actions.

3
Regularly review and adjust scaling thresholds to align with changing workload patterns.
As project demands evolve, maintaining optimal scaling thresholds ensures that resources are neither over-provisioned nor under-utilized.

Common Pitfalls

1

Relying solely on reactive scaling can lead to inefficient resource usage.

This happens because waiting for specific thresholds can delay job execution and lead to wasted compute resources. Transitioning to a proactive approach allows for timely adjustments based on real-time data.

Related Concepts

Auto-scaling Strategies

CI/CD Optimization

Github Actions Management

What happens when your distributed service has challenges with stampeding herds of internal requests? How do you prevent cascading failures between internal services? How might you re-architect your workflows when naive horizontal or vertical scaling reaches their respective limits? These were the challenges facing Slack engineers during their day-to-day development workflows in 2020. Multiple internal…

TypeScriptMySQLAWS

19 min read

Includes Code

Has Summary

--

These articles from ClickHouse and other leading engineering teams share similar topics with "Building an Auto-Scaling Lambda based on Github's Workflow Job Queue". Explore more engineering insights on AWS, GitHub Actions, PowerShell.