Building the Next Evolution of Cloud Networks at Slack – A Retrospective

About a year ago, I wrote a blog post called Building the Next Evolution of Cloud Networks at Slack. In it, we discussed how Slack’s AWS infrastructure has evolved over the years and the pain points that drove us to spin up a brand-new network architecture redesign project called Whitecastle. If you have not had…

Archie Gunasekara
14 min readadvanced
--
View Original

Overview

This article provides a retrospective on the evolution of cloud networks at Slack, focusing on the lessons learned and improvements made since the implementation of a new network architecture called Whitecastle. It discusses the challenges faced during the migration process and outlines future plans for enhancing the network infrastructure.

What You'll Learn

1

How to effectively manage proxy environment variables in AWS

2

Why monitoring AWS Transit Gateway metrics is crucial for cloud operations

3

How to implement a gradual migration strategy for cloud infrastructure

4

When to utilize multiple workload VPCs for better resource management

Prerequisites & Requirements

  • Understanding of AWS networking concepts
  • Experience with cloud infrastructure management(optional)

Key Questions Answered

What challenges did Slack face while migrating to the Whitecastle network?
Slack encountered several challenges during the migration to the Whitecastle network, including the complexity of configuring proxy environment variables and ensuring that applications honored these settings. This added complexity slowed down the migration process as teams had to adapt their applications to the new network architecture.
How does Slack monitor traffic through the AWS Transit Gateway?
Slack uses CloudWatch metrics such as PacketsIn, PacketsOut, PacketDropCountBlackhole, and PacketDropCountNoRoute to monitor traffic patterns at the attachment level. This visibility is critical for managing unpredictable traffic spikes during the migration process.
What improvements were made to the network architecture at Slack?
Improvements included the introduction of multiple workload VPCs to better manage resources, enhanced management of private Route53 zones, and the development of a Network Tester tool to validate connectivity between VPCs. These changes aimed to streamline operations and improve network security.
What is the purpose of the Whitecastle Network Tester?
The Whitecastle Network Tester is designed to validate routes between each VPC by starting a web server that responds to health checks. It ensures that necessary services can communicate while maintaining strict separation between different environments, enhancing network security.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Networking
AWS Transit Gateway
Used to connect multiple VPCs and manage traffic between them.
Networking
Squid Proxy
Implemented for managing internet access for services in private subnets.
Monitoring
AWS Cloudwatch
Utilized for monitoring traffic metrics through the Transit Gateway.
Infrastructure As Code
Terraform
Used for managing VPCs and routing configurations.

Key Actionable Insights

1
Implement a gradual migration strategy to minimize disruptions during network transitions.
By allowing teams to migrate services incrementally rather than all at once, Slack reduced the risk of outages and made the process more manageable.
2
Utilize AWS Transit Gateway metrics to gain insights into traffic patterns and optimize performance.
Monitoring these metrics helps identify potential bottlenecks and ensures that the network can handle varying loads effectively.
3
Develop tools like the Whitecastle Network Tester to automate network validation processes.
Automating the validation of network paths reduces manual errors and enhances security by ensuring that only authorized communications occur between environments.

Common Pitfalls

1
Failing to properly configure proxy environment variables can lead to connectivity issues for applications.
This happens because different programming languages handle proxy settings differently, which can cause confusion and slow down the migration process.
2
Overloading the Transit Gateway with unpredictable traffic can lead to performance degradation.
This occurs when services are migrated without adequate planning, resulting in spikes that the Transit Gateway may struggle to handle.

Related Concepts

AWS Networking
Cloud Infrastructure Management
Network Security Best Practices