Deploys at Slack

Deploys require a careful balance of speed and reliability. At Slack, we value quick iteration, fast feedback loops, and responsiveness to customer feedback. We also have hundreds of engineers who are trying to be as productive as possible. Keeping to these values while growing as a company means continual refinement of our deployment system. We…

Michael Deng
8 min readbeginner
--
View Original

Overview

The article discusses the deployment process at Slack, emphasizing the balance between speed and reliability. It outlines the current deployment workflow, the evolution of their deployment system, and the importance of maintaining stability as the company scales.

What You'll Learn

1

How to implement a percentage-based rollout strategy for deployments

2

Why having deploy commanders is crucial for managing deployment risks

3

How to utilize hot and cold directories for atomic deploys

Prerequisites & Requirements

  • Understanding of deployment processes and CI/CD practices

Key Questions Answered

How does Slack ensure reliability during deployments?
Slack ensures reliability during deployments by designating a deploy commander for each release, who monitors performance and coordinates communication. They also implement a percentage-based rollout strategy that allows for gradual exposure of new builds to production traffic, enabling quick detection and rollback of issues.
What steps are involved in Slack's deployment process?
The deployment process at Slack involves several steps: creating a release branch, deploying to staging for automated tests, rolling out to a dogfood tier for internal testing, and finally executing a percentage-based rollout to production. This structured approach helps in identifying and mitigating potential issues early.
What are atomic deploys and why are they important?
Atomic deploys are a method where new code is copied to a cold directory, and once ready, the server switches to this directory instantly. This approach prevents errors that occur during file copying, ensuring that users do not encounter broken functionalities during deployments.
What challenges did Slack face with their initial deployment model?
Initially, Slack's deployment model struggled with scalability as the number of servers increased. The push-based model could not keep up with the growing infrastructure, leading to longer deployment times and inefficiencies. They eventually transitioned to a pull-based system to maintain deployment velocity.

Key Statistics & Figures

Daily scheduled deploys
12
Slack performs about 12 scheduled deploys each day, ensuring continuous integration and delivery.
Percentage of production traffic for canary deployments
2%
The initial rollout to canary involves about 2% of production traffic, allowing for careful monitoring before full deployment.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Infrastructure
Amazon EC2
Used for hosting Slack's application and managing deployment processes.
Infrastructure
Consul
Utilized for signaling servers to pull new builds concurrently during deployments.

Key Actionable Insights

1
Implement a structured deployment process that includes a designated deploy commander to oversee releases.
This approach helps in managing risks associated with deployments by ensuring that there is someone responsible for monitoring performance and coordinating responses to any issues that arise.
2
Adopt a percentage-based rollout strategy to gradually expose new features to users.
This method allows teams to detect and address any issues before they affect a larger user base, thereby enhancing the overall reliability of the deployment process.
3
Utilize atomic deploys to minimize downtime and errors during code updates.
By preparing new code in a cold directory and switching to it instantly, teams can avoid the common pitfalls of partial updates that lead to broken functionalities.

Common Pitfalls

1
Relying solely on a push-based deployment model can lead to increased deployment times and potential errors.
As Slack scaled, the initial push-based model became inefficient, prompting the need for a pull-based system that allowed for faster and more reliable deployments.