Deploys at Slack

Michael Deng

Deploys require a careful balance of speed and reliability. At Slack, we value quick iteration, fast feedback loops, and responsiveness to customer feedback. We also have hundreds of engineers who are trying to be as productive as possible. Keeping to these values while growing as a company means continual refinement of our deployment system. We…

Slack

•

Michael Deng

•8 min read•beginner•

--

•View Original

ChefConsulGitJenkins

Overview

The article discusses the deployment process at Slack, emphasizing the balance between speed and reliability. It outlines the current deployment workflow, the evolution of their deployment system, and the importance of maintaining stability as the company scales.

What You'll Learn

1

How to implement a percentage-based rollout strategy for deployments

2

Why having deploy commanders is crucial for managing deployment risks

3

How to utilize hot and cold directories for atomic deploys

Prerequisites & Requirements

Understanding of deployment processes and CI/CD practices

Key Questions Answered

How does Slack ensure reliability during deployments?

Slack ensures reliability during deployments by designating a deploy commander for each release, who monitors performance and coordinates communication. They also implement a percentage-based rollout strategy that allows for gradual exposure of new builds to production traffic, enabling quick detection and rollback of issues.

What steps are involved in Slack's deployment process?

The deployment process at Slack involves several steps: creating a release branch, deploying to staging for automated tests, rolling out to a dogfood tier for internal testing, and finally executing a percentage-based rollout to production. This structured approach helps in identifying and mitigating potential issues early.

What are atomic deploys and why are they important?

Atomic deploys are a method where new code is copied to a cold directory, and once ready, the server switches to this directory instantly. This approach prevents errors that occur during file copying, ensuring that users do not encounter broken functionalities during deployments.

What challenges did Slack face with their initial deployment model?

Initially, Slack's deployment model struggled with scalability as the number of servers increased. The push-based model could not keep up with the growing infrastructure, leading to longer deployment times and inefficiencies. They eventually transitioned to a pull-based system to maintain deployment velocity.

Key Statistics & Figures

Daily scheduled deploys

12

Slack performs about 12 scheduled deploys each day, ensuring continuous integration and delivery.

Percentage of production traffic for canary deployments

2%

The initial rollout to canary involves about 2% of production traffic, allowing for careful monitoring before full deployment.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Infrastructure

Amazon EC2

Used for hosting Slack's application and managing deployment processes.

Infrastructure

Consul

Utilized for signaling servers to pull new builds concurrently during deployments.

Key Actionable Insights

1
Implement a structured deployment process that includes a designated deploy commander to oversee releases.
This approach helps in managing risks associated with deployments by ensuring that there is someone responsible for monitoring performance and coordinating responses to any issues that arise.

2
Adopt a percentage-based rollout strategy to gradually expose new features to users.
This method allows teams to detect and address any issues before they affect a larger user base, thereby enhancing the overall reliability of the deployment process.

3
Utilize atomic deploys to minimize downtime and errors during code updates.
By preparing new code in a cold directory and switching to it instantly, teams can avoid the common pitfalls of partial updates that lead to broken functionalities.

Common Pitfalls

1

Relying solely on a push-based deployment model can lead to increased deployment times and potential errors.

As Slack scaled, the initial push-based model became inefficient, prompting the need for a pull-based system that allowed for faster and more reliable deployments.

For anyone who’s ever been involved in the hiring process, it’s no easy feat — particularly in a growing company. To get hiring practices right, it takes iteration based on feedback — both on the internal processes within your company as well as on the external process a candidate experiences. Continuously improving hiring is important for a host of…

TypeScriptReactJulia

12 min read

Has Summary

--

Slack

Advanced

Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

TypeScriptAWSMySQL

11 min read

Has Summary

--

Slack

Advanced

Applying Product Thinking to Slack’s Internal Compute Platform

According to a recent Thoughtworks radar, “the industry is increasingly gaining experience with platform engineering product teams that create and support internal platforms.” They caveated this with a piece of advice: “When creating a platform, it’s critical to have clearly defined customers and products that will benefit from it rather than building in a vacuum.”…

DockerKubernetesJava

13 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Deploys at Slack". Explore more engineering insights on TypeScript, React, AWS.