It’s mid 2023 and we’ve identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward. We’re a year and half into the Deploy Safety Program at Slack, improving the way we deploy, uplifting our safety culture and continuing…
Overview
Slack's Deploy Safety Program, launched in mid-2023, achieved a 90% reduction in customer impact hours by January 2025 through automated detection, remediation, and cultural changes across all deployment systems. The article details how Slack defined metrics, selected investment projects, implemented automatic rollbacks, and evolved their deployment safety practices across hundreds of internal services.
What You'll Learn
How to design a deployment safety program that reduces customer impact from code changes across hundreds of services
How to select and validate a top-line metric that serves as an analog for customer sentiment
Why automatic rollbacks dramatically outperform manual remediation in reducing deployment incident impact
How to structure investment strategy across multiple deployment safety projects with trailing metrics
Why cultural adoption of deployment tooling requires direct training and frequent use, not just building the tools
Prerequisites & Requirements
- Understanding of deployment pipelines and CI/CD concepts
- Familiarity with incident management processes and severity classification
- Understanding of observability and metrics-based monitoring
- Experience operating large-scale distributed systems with multiple deployment methods(optional)
Key Questions Answered
How did Slack reduce customer impact from deployments by 90%?
What percentage of Slack's customer-facing incidents were caused by code deployments?
How should you choose a metric for a deployment safety program?
What are the target timeframes for detecting and remediating deployment issues?
How should you prioritize investment in deployment safety projects when you have many incident sources?
Why do engineers resist adopting new deployment safety tools and rollback processes?
How do you manage a reliability program with trailing metrics that take months to show results?
What is the difference between automatic and manual rollback in reducing deployment incidents?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Implement automatic rollbacks as the highest-priority deployment safety investment. Slack found that automatic rollbacks were dramatically more effective than manual remediation, with customer impact consistently staying below 10 minutes after introduction. Manual remediation, while important as a fallback, showed significantly less impact on reducing customer-facing incidents.This was validated across Slack's webapp backend, frontend, and infrastructure deployments, with results showing continued improvement quarter-over-quarter after automatic rollback adoption.
2Adopt a 'wide investment, then narrow' strategy when you cannot predict which safety projects will have the most impact. Start by investing broadly across known pain points and bias for action, then double down on patterns that show results and cut investment in less impactful areas. Not all projects will succeed, and that is by design.Slack attributed their success to iterating on patterns that worked and copying them to other systems. Incident data is trailing, so waiting for perfect information before investing means customers continue experiencing pain.
3Set concrete North Star goals for deployment safety that target both detection time and blast radius. Slack targeted automated detection and remediation within 10 minutes, manual remediation within 20 minutes, and detecting issues before reaching 10% of the fleet. These evolved into a broader Deploy Safety Manifesto applied across all deployment systems.Customer feedback showed interruptions under 10 minutes were treated as a 'blip,' making sub-10-minute automated remediation a critical threshold for customer satisfaction.
4Train engineers repeatedly and create opportunities for frequent tool usage, not just during incidents. Slack discovered that building rollback tools wasn't enough — engineers delayed adoption because they were unfamiliar and uncomfortable, and without regular practice the tools wouldn't become routine during stressful incidents. Direct training to multiple groups, multiple times, was necessary.The lesson was that infrequently used tools in high-stress situations are effectively unused tools. Frequent practice builds the fluency, confidence, and comfort needed for reliable incident response.
5Choose an imperfect but consistent metric over spending excessive time finding the perfect one. Slack used 'hours of customer impact from change-triggered incidents' and continually validated it against actual customer sentiment through conversations with leaders who interact directly with customers. Consistency in measurement proved more valuable than precision.The semi-loose connection between customer sentiment, program metric, and project metric was an ongoing challenge, especially for engineers who prefer concrete feedback loops. Continual validation against customer conversations helped bridge this gap.
6Maintain executive engagement through regular reviews every 4-6 weeks and ensure deployment safety is a high-level priority in company or engineering goals (OKRs, V2MOMs). This helps sustain investment and alignment across many teams, especially when trailing metrics create uncertainty about whether projects are working.Slack found that management alignment was strong but communication to general engineering staff needed improvement, revealing the importance of communicating safety program priorities at all levels of the organization.