The Scary Thing About Automating Deploys

Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week. Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices…

Sean McIlroy
16 min readadvanced
--
View Original

Overview

The article discusses the complexities and challenges of automating deployments at Slack, particularly in a monolithic service environment. It emphasizes the importance of anomaly detection in automated deployment systems and shares insights on how to effectively implement and monitor such systems.

What You'll Learn

1

How to implement anomaly detection using z scores in deployment monitoring

2

Why automated deployments can improve efficiency and reduce human error

3

When to use dynamic thresholds versus static thresholds in monitoring

Prerequisites & Requirements

  • Basic understanding of deployment processes and monitoring systems
  • Familiarity with Python and statistical analysis libraries(optional)

Key Questions Answered

How does Slack automate its deployment process?
Slack automates its deployment process using a tool called ReleaseBot, which runs 24/7 to continually deploy new builds. This system was developed to manage the complexities of deploying hundreds of changes daily, allowing for faster iterations and quicker responses to customer feedback.
What is the significance of z scores in monitoring deployments?
Z scores are used to detect anomalies in deployment metrics by measuring how far a data point deviates from the mean. A z score threshold breach indicates a significant change that requires attention, helping to identify potential issues quickly during deployments.
What challenges do teams face when automating deployments?
Teams often fear automating deployments due to the risk of breaking production systems. This fear is compounded by the complexities of monitoring and the need for reliable alerting mechanisms to ensure that any issues are promptly addressed.
How does Slack handle monitoring during deployments?
Slack uses both static and dynamic thresholds for monitoring during deployments. Dynamic thresholds are calculated based on historical data, allowing the system to adapt to normal variations in metrics, while static thresholds provide a baseline for alerts.

Key Statistics & Figures

Median deploy size
3 PRs
Slack deploys from its Webapp repository 30-40 times a day, managing a reasonable PR-to-deploy ratio despite the scale of changes.
Deploys per day
30-40
Slack's deployment frequency allows for rapid iteration and response to customer feedback.
Z score threshold for anomaly detection
3
A z score of 3 generally indicates a significant outlier, prompting a review of the deployment metrics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language
Python
Used for implementing the ReleaseBot and calculating z scores for anomaly detection.
Library
Scipy
Utilized for statistical calculations, specifically for computing z scores in the ReleaseBot.

Key Actionable Insights

1
Implement anomaly detection in your deployment monitoring to catch issues early.
By using statistical methods like z scores, you can identify unusual patterns in your metrics that may indicate problems, allowing for quicker remediation.
2
Consider automating your deployment process to reduce human error and increase efficiency.
Automation can streamline your deployment workflow, enabling your team to focus on development rather than manual deployment tasks.
3
Regularly review and adjust your monitoring thresholds to ensure they remain relevant.
As your application evolves, so should your monitoring strategies. This helps maintain effective alerting without overwhelming your team with false positives.

Common Pitfalls

1
Teams often struggle with the fear of automating deployments due to the potential for production failures.
This fear can prevent teams from leveraging automation, but understanding the differences in monitoring can alleviate concerns and lead to more effective deployment strategies.

Related Concepts

Anomaly Detection
Continuous Deployment
Deployment Monitoring