Building the Next Evolution of Cloud Networks at Slack

Archie Gunasekara

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

Slack

•

Archie Gunasekara

•12 min read•intermediate•

--

•View Original

AWSChefDynamoDBGrafanaPythonTerraformTypeScript

Overview

The article discusses Slack's evolution of cloud networking, detailing the redesign of their AWS infrastructure through a project named Whitecastle. It highlights the challenges faced with scaling and managing multiple AWS accounts, and how the implementation of shared VPCs and Transit Gateways improved their network architecture.

What You'll Learn

1

How to implement AWS shared VPCs for better resource management

2

Why using Transit Gateway Inter-Region Peering enhances connectivity

3

How to automate network testing with a custom application

Prerequisites & Requirements

Understanding of AWS networking concepts
Familiarity with Terraform for infrastructure management

Key Questions Answered

What challenges did Slack face with their AWS infrastructure?

Slack faced issues with AWS rate-limiting, cost-separation, and confusion due to having all infrastructure in a single AWS account. This led to the introduction of child accounts to manage resources more effectively, but it also created complexities with CIDR ranges and IP management.

How did Slack simplify their network management?

Slack simplified their network management by implementing AWS shared VPCs, allowing multiple AWS accounts to share VPCs and subnets. This reduced the administrative overhead of managing multiple accounts and helped avoid AWS rate limits.

What is the purpose of the Whitecastle Network Tester?

The Whitecastle Network Tester is a Go application that performs real-time network testing across Slack's VPCs. It checks connectivity between services and reports results to CloudWatch, helping to ensure network reliability.

How does Slack handle inter-region connectivity?

Slack utilizes AWS Transit Gateway Inter-Region Peering to establish connectivity between regions. This involves creating Transit Gateways in each region and peering them to facilitate communication for customer-facing workloads.

Key Statistics & Figures

CIDR range capacity

Over 130,000 IP addresses per region

This capacity is provided by using two /16 CIDR ranges in each VPC, ensuring ample space for current and future needs.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure

AWS

Used for building and managing Slack's scalable cloud network.

Infrastructure As Code

Terraform

Used to automate the management of AWS resources and network configurations.

Programming Language

Go

Used to develop the Whitecastle Network Tester application for network monitoring.

Key Actionable Insights

1
Implement AWS shared VPCs to streamline network management across multiple accounts.
This approach allows for better resource allocation and reduces the complexity of managing separate VPCs for each account, which is crucial for scaling operations.

2
Utilize Transit Gateway Inter-Region Peering to enhance service communication across AWS regions.
This method ensures that services in different regions can communicate effectively, which is essential for maintaining performance and reliability in a distributed architecture.

3
Adopt real-time network testing to proactively identify connectivity issues.
By implementing a network testing application, teams can monitor the health of their network and address potential problems before they impact users.

Common Pitfalls

1

Failing to manage CIDR ranges can lead to overlapping IP spaces and connectivity issues.

This can complicate VPC peering and result in significant administrative overhead. Proper planning and management of IP spaces are essential to avoid these pitfalls.

Related Concepts

AWS Networking Concepts

Infrastructure As Code With Terraform

Network Reliability And Monitoring

About a year ago, I wrote a blog post called Building the Next Evolution of Cloud Networks at Slack. In it, we discussed how Slack’s AWS infrastructure has evolved over the years and the pain points that drove us to spin up a brand-new network architecture redesign project called Whitecastle. If you have not had…

TypeScriptGolangAWS

14 min read

Includes Code

Has Summary

--

Slack

Advanced

Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

TypeScriptAWSMySQL

11 min read

Has Summary

--

Slack

Intermediate

Women in Security at Slack

Since its inception, Slack has fostered a culture of inclusion and diversity. The Security organization at Slack is a prime example of how women can thrive in the security space, transitioning to security from different backgrounds and expertises. With Slack’s strong commitment to diversity, it should not be a surprise that nearly a third of…

TypeScriptPHPHTML

12 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Building the Next Evolution of Cloud Networks at Slack". Explore more engineering insights on TypeScript, Golang, AWS.