Building the Next Evolution of Cloud Networks at Slack

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

Archie Gunasekara
12 min readintermediate
--
View Original

Overview

The article discusses Slack's evolution of cloud networking, detailing the redesign of their AWS infrastructure through a project named Whitecastle. It highlights the challenges faced with scaling and managing multiple AWS accounts, and how the implementation of shared VPCs and Transit Gateways improved their network architecture.

What You'll Learn

1

How to implement AWS shared VPCs for better resource management

2

Why using Transit Gateway Inter-Region Peering enhances connectivity

3

How to automate network testing with a custom application

Prerequisites & Requirements

  • Understanding of AWS networking concepts
  • Familiarity with Terraform for infrastructure management

Key Questions Answered

What challenges did Slack face with their AWS infrastructure?
Slack faced issues with AWS rate-limiting, cost-separation, and confusion due to having all infrastructure in a single AWS account. This led to the introduction of child accounts to manage resources more effectively, but it also created complexities with CIDR ranges and IP management.
How did Slack simplify their network management?
Slack simplified their network management by implementing AWS shared VPCs, allowing multiple AWS accounts to share VPCs and subnets. This reduced the administrative overhead of managing multiple accounts and helped avoid AWS rate limits.
What is the purpose of the Whitecastle Network Tester?
The Whitecastle Network Tester is a Go application that performs real-time network testing across Slack's VPCs. It checks connectivity between services and reports results to CloudWatch, helping to ensure network reliability.
How does Slack handle inter-region connectivity?
Slack utilizes AWS Transit Gateway Inter-Region Peering to establish connectivity between regions. This involves creating Transit Gateways in each region and peering them to facilitate communication for customer-facing workloads.

Key Statistics & Figures

CIDR range capacity
Over 130,000 IP addresses per region
This capacity is provided by using two /16 CIDR ranges in each VPC, ensuring ample space for current and future needs.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement AWS shared VPCs to streamline network management across multiple accounts.
This approach allows for better resource allocation and reduces the complexity of managing separate VPCs for each account, which is crucial for scaling operations.
2
Utilize Transit Gateway Inter-Region Peering to enhance service communication across AWS regions.
This method ensures that services in different regions can communicate effectively, which is essential for maintaining performance and reliability in a distributed architecture.
3
Adopt real-time network testing to proactively identify connectivity issues.
By implementing a network testing application, teams can monitor the health of their network and address potential problems before they impact users.

Common Pitfalls

1
Failing to manage CIDR ranges can lead to overlapping IP spaces and connectivity issues.
This can complicate VPC peering and result in significant administrative overhead. Proper planning and management of IP spaces are essential to avoid these pitfalls.

Related Concepts

AWS Networking Concepts
Infrastructure As Code With Terraform
Network Reliability And Monitoring