Infrastructure Observability for Changing the Spend Curve

Slack is an integral part of where work happens for teams across the world, and our work in the Core Development Engineering department supports engineers throughout Slack that develop, build, test, and release high-quality services to Slack’s customers. In this article, we share how teams at Slack evolved our internal tooling and made infrastructure bets.…

Frank Chen
13 min readintermediate
--
View Original

Overview

The article discusses how Slack achieved a significant reduction in infrastructure spending through improved observability and changes in their Continuous Integration (CI) infrastructure. It highlights the strategies implemented to enhance efficiency and reduce costs by an order of magnitude over two years.

What You'll Learn

1

How to leverage observability to improve CI infrastructure efficiency

2

Why adaptive capacity can reduce CI costs significantly

3

How to implement circuit breakers in CI workflows to enhance stability

Prerequisites & Requirements

  • Understanding of Continuous Integration concepts
  • Familiarity with CI/CD tools like Jenkins(optional)

Key Questions Answered

How did Slack achieve a 10x reduction in infrastructure spending?
Slack achieved a 10x reduction in infrastructure spending by implementing observability practices and making data-driven decisions in their CI infrastructure. They focused on adaptive capacity, circuit breakers, and pipeline changes to optimize resource usage and improve efficiency.
What role do circuit breakers play in CI workflows?
Circuit breakers in CI workflows help prevent cascading failures by stopping calls to faulty systems and deferring work until the system recovers. This approach stabilizes throughput and reduces developer frustration by minimizing flaky test executions.
What are the challenges faced by Slack's CI infrastructure?
Slack's CI infrastructure faced challenges such as increased complexity, fuzzy service boundaries, and systems stretched to their limits due to rapid growth. These challenges necessitated a reevaluation of internal tools and infrastructure strategies to maintain efficiency.
What is the significance of adaptive capacity in CI?
Adaptive capacity is significant in CI as it allows for increased throughput and reduced costs by managing infrastructure runtime effectively. By oversubscribing executors and optimizing instance types, Slack was able to improve performance while decreasing error rates.

Key Statistics & Figures

Cost reduction
10x reduction compared to baseline growth
This reduction was achieved over the last two years through iterative changes in Slack's CI infrastructure.
Error rate reduction
approximately 50%
This reduction was achieved through strategies implemented to increase overall throughput and reduce errors at peak usage.
Cost savings from instance type updates
approximately 10%
This was expected from updating to newer AWS compute-optimized instances.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

CI/CD Tool
Jenkins
Used for orchestrating build and test workflows in Slack's CI infrastructure.
CI Orchestration Tool
Checkpoint
An internally developed service that orchestrates CI and Continuous Deployment (CD) workflows.

Key Actionable Insights

1
Implement observability practices across your CI infrastructure to gather actionable metrics.
By focusing on observability, teams can identify bottlenecks and inefficiencies, leading to informed decisions that enhance performance and reduce costs.
2
Utilize circuit breakers to manage downstream service failures effectively.
Incorporating circuit breakers can prevent cascading failures in CI workflows, improving overall system stability and reducing developer frustration.
3
Adopt an adaptive capacity strategy to optimize resource utilization during peak workloads.
By understanding workload patterns and adjusting resource allocation accordingly, teams can significantly reduce costs while maintaining high performance.

Common Pitfalls

1
Failing to monitor and analyze CI performance metrics can lead to inefficiencies.
Without proper observability, teams may overlook critical bottlenecks and continue to incur unnecessary costs and delays.
2
Neglecting to implement circuit breakers can result in cascading failures during peak usage.
This can lead to significant downtime and frustration among developers, as they may need to manually re-run failed tests.

Related Concepts

Continuous Integration
Observability
Infrastructure Optimization
CI/CD Best Practices