The Scary Thing About Automating Deploys

Sean McIlroy

Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week. Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices…

Slack

•

Sean McIlroy

•16 min read•advanced•

--

•View Original

AWSChefEnvoyJenkinsPHPPrometheusPythonTypeScript

Overview

The article discusses the complexities and challenges of automating deployments at Slack, particularly in a monolithic service environment. It emphasizes the importance of anomaly detection in automated deployment systems and shares insights on how to effectively implement and monitor such systems.

What You'll Learn

1

How to implement anomaly detection using z scores in deployment monitoring

2

Why automated deployments can improve efficiency and reduce human error

3

When to use dynamic thresholds versus static thresholds in monitoring

Prerequisites & Requirements

Basic understanding of deployment processes and monitoring systems
Familiarity with Python and statistical analysis libraries(optional)

Key Questions Answered

How does Slack automate its deployment process?

Slack automates its deployment process using a tool called ReleaseBot, which runs 24/7 to continually deploy new builds. This system was developed to manage the complexities of deploying hundreds of changes daily, allowing for faster iterations and quicker responses to customer feedback.

What is the significance of z scores in monitoring deployments?

Z scores are used to detect anomalies in deployment metrics by measuring how far a data point deviates from the mean. A z score threshold breach indicates a significant change that requires attention, helping to identify potential issues quickly during deployments.

What challenges do teams face when automating deployments?

Teams often fear automating deployments due to the risk of breaking production systems. This fear is compounded by the complexities of monitoring and the need for reliable alerting mechanisms to ensure that any issues are promptly addressed.

How does Slack handle monitoring during deployments?

Slack uses both static and dynamic thresholds for monitoring during deployments. Dynamic thresholds are calculated based on historical data, allowing the system to adapt to normal variations in metrics, while static thresholds provide a baseline for alerts.

Key Statistics & Figures

Median deploy size

3 PRs

Slack deploys from its Webapp repository 30-40 times a day, managing a reasonable PR-to-deploy ratio despite the scale of changes.

Deploys per day

30-40

Slack's deployment frequency allows for rapid iteration and response to customer feedback.

Z score threshold for anomaly detection

3

A z score of 3 generally indicates a significant outlier, prompting a review of the deployment metrics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language

Python

Used for implementing the ReleaseBot and calculating z scores for anomaly detection.

Library

Scipy

Utilized for statistical calculations, specifically for computing z scores in the ReleaseBot.

Key Actionable Insights

1
Implement anomaly detection in your deployment monitoring to catch issues early.
By using statistical methods like z scores, you can identify unusual patterns in your metrics that may indicate problems, allowing for quicker remediation.

2
Consider automating your deployment process to reduce human error and increase efficiency.
Automation can streamline your deployment workflow, enabling your team to focus on development rather than manual deployment tasks.

3
Regularly review and adjust your monitoring thresholds to ensure they remain relevant.
As your application evolves, so should your monitoring strategies. This helps maintain effective alerting without overwhelming your team with false positives.

Common Pitfalls

1

Teams often struggle with the fear of automating deployments due to the potential for production failures.

This fear can prevent teams from leveraging automation, but understanding the differences in monitoring can alleviate concerns and lead to more effective deployment strategies.

Related Concepts

Anomaly Detection

Continuous Deployment

Deployment Monitoring

Since its inception, Slack has fostered a culture of inclusion and diversity. The Security organization at Slack is a prime example of how women can thrive in the security space, transitioning to security from different backgrounds and expertises. With Slack’s strong commitment to diversity, it should not be a surprise that nearly a third of…

TypeScriptPHPHTML

12 min read

Has Summary

--

Slack

Advanced

Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD

What happens when your distributed service has challenges with stampeding herds of internal requests? How do you prevent cascading failures between internal services? How might you re-architect your workflows when naive horizontal or vertical scaling reaches their respective limits? These were the challenges facing Slack engineers during their day-to-day development workflows in 2020. Multiple internal…

TypeScriptMySQLAWS

19 min read

Includes Code

Has Summary

--

Slack

Advanced

Optimizing Our E2E Pipeline

In the world of DevOps and Developer Experience (DevXP), speed and efficiency can make a big difference on an engineer’s day-to-day tasks. Today, we’ll dive into how Slack’s DevXP team took some existing tools and used them to optimize an end-to-end (E2E) testing pipeline. This lowered build times and reduced redundant processes, saving both time…

TypeScriptAWSAWS S3

7 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "The Scary Thing About Automating Deploys". Explore more engineering insights on TypeScript, PHP, MySQL.