Infrastructure Observability for Changing the Spend Curve

Frank Chen

Slack is an integral part of where work happens for teams across the world, and our work in the Core Development Engineering department supports engineers throughout Slack that develop, build, test, and release high-quality services to Slack’s customers. In this article, we share how teams at Slack evolved our internal tooling and made infrastructure bets.…

Slack

•

Frank Chen

•13 min read•intermediate•

--

•View Original

AWSChefJenkinsPHP

Overview

The article discusses how Slack achieved a significant reduction in infrastructure spending through improved observability and changes in their Continuous Integration (CI) infrastructure. It highlights the strategies implemented to enhance efficiency and reduce costs by an order of magnitude over two years.

What You'll Learn

1

How to leverage observability to improve CI infrastructure efficiency

2

Why adaptive capacity can reduce CI costs significantly

3

How to implement circuit breakers in CI workflows to enhance stability

Prerequisites & Requirements

Understanding of Continuous Integration concepts
Familiarity with CI/CD tools like Jenkins(optional)

Key Questions Answered

How did Slack achieve a 10x reduction in infrastructure spending?

Slack achieved a 10x reduction in infrastructure spending by implementing observability practices and making data-driven decisions in their CI infrastructure. They focused on adaptive capacity, circuit breakers, and pipeline changes to optimize resource usage and improve efficiency.

What role do circuit breakers play in CI workflows?

Circuit breakers in CI workflows help prevent cascading failures by stopping calls to faulty systems and deferring work until the system recovers. This approach stabilizes throughput and reduces developer frustration by minimizing flaky test executions.

What are the challenges faced by Slack's CI infrastructure?

Slack's CI infrastructure faced challenges such as increased complexity, fuzzy service boundaries, and systems stretched to their limits due to rapid growth. These challenges necessitated a reevaluation of internal tools and infrastructure strategies to maintain efficiency.

What is the significance of adaptive capacity in CI?

Adaptive capacity is significant in CI as it allows for increased throughput and reduced costs by managing infrastructure runtime effectively. By oversubscribing executors and optimizing instance types, Slack was able to improve performance while decreasing error rates.

Key Statistics & Figures

Cost reduction

10x reduction compared to baseline growth

This reduction was achieved over the last two years through iterative changes in Slack's CI infrastructure.

Error rate reduction

approximately 50%

This reduction was achieved through strategies implemented to increase overall throughput and reduce errors at peak usage.

Cost savings from instance type updates

approximately 10%

This was expected from updating to newer AWS compute-optimized instances.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

CI/CD Tool

Jenkins

Used for orchestrating build and test workflows in Slack's CI infrastructure.

CI Orchestration Tool

Checkpoint

An internally developed service that orchestrates CI and Continuous Deployment (CD) workflows.

Key Actionable Insights

1
Implement observability practices across your CI infrastructure to gather actionable metrics.
By focusing on observability, teams can identify bottlenecks and inefficiencies, leading to informed decisions that enhance performance and reduce costs.

2
Utilize circuit breakers to manage downstream service failures effectively.
Incorporating circuit breakers can prevent cascading failures in CI workflows, improving overall system stability and reducing developer frustration.

3
Adopt an adaptive capacity strategy to optimize resource utilization during peak workloads.
By understanding workload patterns and adjusting resource allocation accordingly, teams can significantly reduce costs while maintaining high performance.

Common Pitfalls

1

Failing to monitor and analyze CI performance metrics can lead to inefficiencies.

Without proper observability, teams may overlook critical bottlenecks and continue to incur unnecessary costs and delays.

2

Neglecting to implement circuit breakers can result in cascading failures during peak usage.

This can lead to significant downtime and frustration among developers, as they may need to manually re-run failed tests.

Related Concepts

Continuous Integration

Observability

Infrastructure Optimization

CI/CD Best Practices

Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week. Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices…

TypeScriptAWSPHP

16 min read

Includes Code

Has Summary

--

Slack

Advanced

Optimizing Our E2E Pipeline

In the world of DevOps and Developer Experience (DevXP), speed and efficiency can make a big difference on an engineer’s day-to-day tasks. Today, we’ll dive into how Slack’s DevXP team took some existing tools and used them to optimize an end-to-end (E2E) testing pipeline. This lowered build times and reduced redundant processes, saving both time…

TypeScriptAWSAWS S3

7 min read

Includes Code

Has Summary

--

Slack

Beginner

Personalized channel recommendations in Slack

Public channels provide much of Slack’s advantages over email: they are searchable, long-lasting, themed conversations that are easy to join and leave. But for users, curating the perfect set of channels can leave them feeling like Goldilocks — it’s easy to be in too many, too few, or miss critical ones. A common customer request is for tools…

PHPJenkinsChef

9 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Infrastructure Observability for Changing the Spend Curve". Explore more engineering insights on TypeScript, AWS, PHP.