Deploy Safety: Reducing customer impact from change

Sam Bailey

It’s mid 2023 and we’ve identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward. We’re a year and half into the Deploy Safety Program at Slack, improving the way we deploy, uplifting our safety culture and continuing…

Slack

•

Sam Bailey

•12 min read•intermediate•

--

•View Original

AWSChefJenkinsKubernetesPythonSolidTerraformTypeScript

Overview

Slack's Deploy Safety Program, launched in mid-2023, achieved a 90% reduction in customer impact hours by January 2025 through automated detection, remediation, and cultural changes across all deployment systems. The article details how Slack defined metrics, selected investment projects, implemented automatic rollbacks, and evolved their deployment safety practices across hundreds of internal services.

What You'll Learn

1

How to design a deployment safety program that reduces customer impact from code changes across hundreds of services

2

How to select and validate a top-line metric that serves as an analog for customer sentiment

3

Why automatic rollbacks dramatically outperform manual remediation in reducing deployment incident impact

4

How to structure investment strategy across multiple deployment safety projects with trailing metrics

5

Why cultural adoption of deployment tooling requires direct training and frequent use, not just building the tools

Prerequisites & Requirements

Understanding of deployment pipelines and CI/CD concepts
Familiarity with incident management processes and severity classification
Understanding of observability and metrics-based monitoring
Experience operating large-scale distributed systems with multiple deployment methods(optional)

Key Questions Answered

How did Slack reduce customer impact from deployments by 90%?

Slack launched the Deploy Safety Program in mid-2023, investing across multiple projects targeting automated detection and remediation. The biggest breakthrough came from introducing automatic rollbacks for webapp backend deployments, which kept customer impact below 10 minutes. Combined with metrics-based deployment monitoring, manual rollback improvements, and cultural changes encouraging teams to 'just roll back,' they achieved a 90% reduction in customer impact hours by January 2025.

What percentage of Slack's customer-facing incidents were caused by code deployments?

Analysis showed that 73% of customer-facing incidents at Slack were triggered by Slack-induced change, particularly code deploys. This finding was a key driver for creating the Deploy Safety Program, as it revealed that the majority of reliability issues came from the deployment process itself rather than external factors or system failures unrelated to changes.

How should you choose a metric for a deployment safety program?

Slack chose 'hours of customer impact from high severity and selected medium severity change-triggered incidents' as their Deploy Safety metric. Key criteria include: measuring actual results rather than effort, understanding whether you're measuring real impact or an analog, maintaining consistency in measurement (especially subjective portions), and continually validating the metric matches customer sentiment through direct customer conversations. Pick an imperfect metric and stay consistent rather than spending excessive time finding the perfect one.

What are the target timeframes for detecting and remediating deployment issues?

Slack's North Star goals specified automated detection and remediation within 10 minutes, and manual detection and remediation within 20 minutes. They also targeted detecting problematic deployments before they reached 10% of the fleet to reduce blast radius. Customer feedback indicated that interruptions became more disruptive after about 10 minutes, which they would otherwise treat as a 'blip.'

How should you prioritize investment in deployment safety projects when you have many incident sources?

Slack's strategy was to invest widely initially and bias for action, focus on areas of known pain first, then invest further in projects showing results while curtailing investment in less impactful areas. They maintained a flexible shorter-term roadmap that could change based on results. Projects that didn't have desired impact weren't considered failures but critical inputs guiding future investment decisions and understanding which areas provide greater value.

Why do engineers resist adopting new deployment safety tools and rollback processes?

Engineers delay adoption until they're familiar and comfortable with new practices, worrying they'll make problems worse by following an unknown path even when reassured it's better. Slack found that without frequent use to build fluency, confidence, and comfort, processes and tools won't become routine during stressful incidents. The solution requires providing direct training multiple times to multiple groups, continually improving rollback tooling based on incident experience, and using tools frequently rather than only during worst-case scenarios.

How do you manage a reliability program with trailing metrics that take months to show results?

Slack experienced a 3-6 month lag time to observe each project's full impact and even saw their peak impact quarter occur after initial projects were delivered. The approach requires patience, gathering intermediate metrics to confirm improvements are functioning (like issue detection rates) while waiting for full results, having faith in decisions made with available information, and maintaining agility to change direction once results are confirmed. Executive reviews every 4-6 weeks help maintain confidence and alignment.

What is the difference between automatic and manual rollback in reducing deployment incidents?

At Slack, automatic rollbacks proved dramatically more effective than manual remediation. Their first quarter of metrics-based deploy alerts with manual remediation showed improvement, but the peak impact quarter still occurred. Once automatic rollbacks were introduced in April 2024, they observed dramatic improvement in results, with customer impact consistently staying below 10 minutes. This success led Slack to reduce investment in manual remediation improvements in favor of greater automation.

Key Statistics & Figures

Customer impact hours reduction

90%

Reduction from peak to January 2025, measured over approximately 18 months of the Deploy Safety Program

Change-triggered incidents percentage

73%

Percentage of customer-facing incidents triggered by Slack-induced change, particularly code deploys

Automated remediation target time

10 minutes

North Star goal for automated detection and remediation of deployment issues

Manual remediation target time

20 minutes

North Star goal for manual detection and remediation of deployment issues

Fleet exposure threshold

10%

Target to detect problematic deployments before reaching 10% of the fleet

Customer disruption tolerance

10 minutes

Customer feedback indicated interruptions under 10 minutes were treated as a 'blip'

Results observation lag time

3-6 months

Typical delay from project delivery to observing full impact on the top-line metric

Executive review cadence

Every 4-6 weeks

Frequency of executive reviews to ensure continued alignment and support

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Infrastructure

Kubernetes

Container orchestration platform used as part of Slack Bedrock for deployments

Cloud Platform

AWS

Cloud infrastructure including AWS Pipelines deployment system as inspiration for centralized deployment orchestration

Infrastructure

EC2

Target for centralized deployment orchestration tooling expansion

Infrastructure As Code

Terraform

Target for centralized deployment orchestration tooling expansion

Infrastructure

Karpenter

Kubernetes node autoscaler adopted by Slack for operational and cost efficiency

Deployment Tooling

Releasebot

Slack's deployment automation system that inspired centralized deployment orchestration

Internal Compute Platform

Slack Bedrock

Slack's internal compute platform supporting metrics-based deployments with automatic remediation

Key Actionable Insights

1
Implement automatic rollbacks as the highest-priority deployment safety investment. Slack found that automatic rollbacks were dramatically more effective than manual remediation, with customer impact consistently staying below 10 minutes after introduction. Manual remediation, while important as a fallback, showed significantly less impact on reducing customer-facing incidents.
This was validated across Slack's webapp backend, frontend, and infrastructure deployments, with results showing continued improvement quarter-over-quarter after automatic rollback adoption.

2
Adopt a 'wide investment, then narrow' strategy when you cannot predict which safety projects will have the most impact. Start by investing broadly across known pain points and bias for action, then double down on patterns that show results and cut investment in less impactful areas. Not all projects will succeed, and that is by design.
Slack attributed their success to iterating on patterns that worked and copying them to other systems. Incident data is trailing, so waiting for perfect information before investing means customers continue experiencing pain.

3
Set concrete North Star goals for deployment safety that target both detection time and blast radius. Slack targeted automated detection and remediation within 10 minutes, manual remediation within 20 minutes, and detecting issues before reaching 10% of the fleet. These evolved into a broader Deploy Safety Manifesto applied across all deployment systems.
Customer feedback showed interruptions under 10 minutes were treated as a 'blip,' making sub-10-minute automated remediation a critical threshold for customer satisfaction.

4
Train engineers repeatedly and create opportunities for frequent tool usage, not just during incidents. Slack discovered that building rollback tools wasn't enough — engineers delayed adoption because they were unfamiliar and uncomfortable, and without regular practice the tools wouldn't become routine during stressful incidents. Direct training to multiple groups, multiple times, was necessary.
The lesson was that infrequently used tools in high-stress situations are effectively unused tools. Frequent practice builds the fluency, confidence, and comfort needed for reliable incident response.

5
Choose an imperfect but consistent metric over spending excessive time finding the perfect one. Slack used 'hours of customer impact from change-triggered incidents' and continually validated it against actual customer sentiment through conversations with leaders who interact directly with customers. Consistency in measurement proved more valuable than precision.
The semi-loose connection between customer sentiment, program metric, and project metric was an ongoing challenge, especially for engineers who prefer concrete feedback loops. Continual validation against customer conversations helped bridge this gap.

6
Maintain executive engagement through regular reviews every 4-6 weeks and ensure deployment safety is a high-level priority in company or engineering goals (OKRs, V2MOMs). This helps sustain investment and alignment across many teams, especially when trailing metrics create uncertainty about whether projects are working.
Slack found that management alignment was strong but communication to general engineering staff needed improvement, revealing the importance of communicating safety program priorities at all levels of the organization.

Common Pitfalls

1

Relying solely on manual remediation processes for deployment incidents instead of investing in automatic rollbacks. Slack found that even with metrics-based deployment alerts and well-trained teams performing manual rollbacks, customer impact remained significant. The peak quarter of impact occurred during the manual remediation phase, before automation was introduced.

Automatic rollbacks proved dramatically more effective, keeping customer impact below 10 minutes, and the investment shift from manual to automated remediation was the single biggest factor in Slack's 90% reduction in impact hours.

2

Building safety tools without investing in adoption training and frequent usage. Engineers will avoid unfamiliar processes during stressful incidents, effectively negating the investment in building the capability. Slack discovered that without regular practice to build fluency, tools might as well not exist when they're needed most.

The solution requires multiple training sessions for multiple groups and designing processes so they're used frequently, not just for infrequent worst-case scenarios. 'Just roll back!' became a cultural mantra only after significant training investment.

3

Expecting immediate measurable results from deployment safety projects when using trailing incident metrics. Slack experienced a 3-6 month lag time to observe each project's full impact, and their peak impact quarter actually came after initial projects had been delivered. This can undermine confidence in the program and lead to premature pivots.

Gather intermediate metrics to confirm improvements are functioning (like issue detection rates) while waiting for full results. Maintain patience and faith that you've made the best decisions with available information, while staying agile enough to change direction once results are confirmed.

4

Applying a one-size-fits-all approach to deployment safety across all teams and systems. Slack learned that not all teams and systems are the same — some teams know their pain points and have ideas, while others want to improve but need additional resources and guidance.

Direct outreach to individual engineering teams was critical. The Deploy Safety program team engaged directly with teams to understand their specific systems and processes, provide tailored improvement guidance, and encourage innovation and prioritization.

5

Spending excessive time deliberating on the perfect deployment safety metric instead of picking one and maintaining consistency. Slack acknowledges there is no perfect metric, as there's always a semi-loose connection between customer sentiment, program metrics, and project metrics.

Pick a metric that reasonably approximates customer sentiment, be consistent with it, and validate through direct customer conversations. Continual refinement is better than endless deliberation. Engineers prefer concrete feedback loops, so provide interim metrics alongside the top-line measurement.

Related Concepts

Deployment Automation And CI/CD Pipelines

Incident Management And Severity Classification

Automatic Rollback Systems

Canary Deployments And Progressive Rollouts

Blast Radius Control And Isolation Boundaries

Observability And Metrics-based Monitoring

Site Reliability Engineering (sre)

Change Management In Distributed Systems

Deployment Orchestration Platforms

Okrs And V2moms For Engineering Alignment

Customer Impact Measurement And Trailing Metrics

Ai-based Anomaly Detection For Deployments