Traffic 101: Packets Mostly Flow

Pramila Singh

Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal…

Slack

•

Pramila Singh

•10 min read•intermediate•

--

•View Original

AWSCDNChefEnvoyHTTPSWebSocket

Overview

The article 'Traffic 101: Packets Mostly Flow' provides an in-depth look at how Slack processes billions of network requests daily through its edge network and AWS infrastructure. It explains the flow of packets from user requests to Slack's internal services, detailing the mechanisms behind DNS resolution, WebSocket traffic, and API traffic management.

What You'll Learn

1

How to manage WebSocket connections effectively in a distributed system

2

Why DNS resolution is critical for reducing latency in network requests

3

When to implement regional failovers to maintain service availability

4

How to utilize a CDN for improved performance in web applications

Prerequisites & Requirements

Basic understanding of networking concepts and DNS resolution
Familiarity with AWS services, particularly Route53 and CloudFront(optional)

Key Questions Answered

How does Slack handle billions of network requests daily?

Slack manages billions of network requests through a globally-distributed edge network and AWS infrastructure. Requests are processed via edge Points of Presence (PoPs) that reduce latency and improve performance, ensuring efficient routing to Slack's core services in the AWS us-east-1 region.

What is the role of DNS in Slack's network architecture?

DNS plays a crucial role in Slack's network architecture by resolving user requests to the nearest edge PoP. Slack uses Amazon Route53 as its authoritative name server, which helps route traffic based on the user's location, thereby optimizing latency and performance.

What happens if the primary WebSocket connection fails?

If the primary WebSocket connection at wss-primary.slack.com fails, Slack clients automatically attempt to connect to wss-backup.slack.com. This failover mechanism is designed to maintain service continuity and minimize user disruption during outages.

How does Slack ensure API traffic is efficiently managed?

Slack manages API traffic through its envoy-edge service, which routes requests based on the originating DNS domain. This service ensures that API requests are directed to the nearest Network Load Balancer (NLB), optimizing response times and resource utilization.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure

AWS

Used for hosting Slack's core services and managing network traffic through various services like Route53 and CloudFront.

Protocol

Websocket

Used for real-time communication between Slack clients and servers.

DNS Service

Amazon Route53

Acts as the authoritative name server for Slack's domains, facilitating efficient DNS resolution.

CDN

Cloudfront

Serves static assets and improves performance by caching content closer to users.

Key Actionable Insights

1
Implement a robust DNS resolution strategy to optimize network request handling.
By leveraging services like Amazon Route53, you can ensure that user requests are routed to the nearest edge location, reducing latency and improving overall application performance.

2
Utilize WebSocket connections for real-time communication in applications.
WebSocket connections allow for persistent communication channels, which are essential for applications that require instant data updates, such as chat applications or live notifications.

3
Plan for regional failovers to enhance service reliability.
By implementing automated regional failover strategies, you can minimize downtime and maintain service availability during infrastructure issues, ensuring a seamless user experience.

Common Pitfalls

1

Failing to implement a proper failover strategy can lead to significant downtime during outages.

Without a failover mechanism, users may experience service interruptions, leading to frustration and potential loss of engagement. It's crucial to have backup systems in place to redirect traffic during failures.

Related Concepts

Networking And Infrastructure Management

DNS Resolution Techniques

Websocket Protocol Usage

Content Delivery Networks (cdns)

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy,…

AWSChefEnvoy

14 min read

Includes Code

Has Summary

--

Stripe

Intermediate

Data access patterns for simple Stripe integrations

Is your Stripe integration ready to scale with your application? In this blog post, explore smart data strategies to enhance performance and security. Learn how to leverage Stripe/'s features, secure web backends, and serverless functions for efficient data management. Discover when to integrate a global CDN and use a separate database for deeper data control, ensuring a seamless user experience.

PostgreSQLMongoDBDynamoDB

8 min read

Has Summary

--

Slack

Advanced

Making Slack Faster By Being Lazy

Software performance is like a series of card tricks: Do less up front. Be really lazy. Prepare in the background. Be one step ahead of the user. Whether doing magic with cards or a browser, it doesn’t hurt to have an ace up your sleeve. ♠️ This two-part series is about our work refactoring part of the Slack desktop client…

ReactJavaScriptChef

14 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Traffic 101: Packets Mostly Flow". Explore more engineering insights on AWS, Chef, PostgreSQL.