Migrating Millions of Concurrent Websockets to Envoy

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy,…

Ariane van der Steldt
14 min readintermediate
--
View Original

Overview

This article discusses Slack's migration of millions of concurrent WebSocket connections from HAProxy to Envoy Proxy. It outlines the challenges faced with HAProxy, the motivations for switching to Envoy, the migration process, and the outcomes achieved.

What You'll Learn

1

How to implement hot restarts in Envoy Proxy to avoid dropping connections

2

Why to choose Envoy Proxy over HAProxy for WebSocket connections

3

How to manage Envoy configurations using Chef

4

When to apply weighted routing for DNS during migration

Prerequisites & Requirements

  • Understanding of WebSocket connections and load balancing concepts
  • Familiarity with Envoy Proxy and HAProxy(optional)
  • Experience with configuration management tools like Chef

Key Questions Answered

What were the main challenges faced with HAProxy during the migration?
The main challenges included operational overhead due to the need for frequent HAProxy reloads, which could disrupt long-lived WebSocket connections. Additionally, managing multiple HAProxy processes with different configurations added complexity to the system.
How did Slack ensure a smooth migration to Envoy Proxy?
Slack built a new Envoy WebSocket stack with equivalent configurations to HAProxy, allowing for a gradual traffic shift. They used weighted routing for DNS to control the migration process, ensuring they could quickly revert to HAProxy if necessary.
What benefits does Envoy Proxy provide over HAProxy?
Envoy Proxy offers dynamic configuration capabilities, allowing for hot restarts without dropping connections, advanced load balancing features, and reduced operational overhead compared to HAProxy, which requires more manual management.
What testing strategies were used for the new Envoy configuration?
Testing involved prototyping with hand-coded Envoy configurations, validating routing with curl, and using Envoy's debug logging for troubleshooting. They also ran configurations in validation mode to prevent invalid setups from being deployed.

Key Statistics & Figures

Number of concurrent WebSocket connections migrated
Millions
This migration was necessary to handle peak loads effectively without impacting user experience.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Envoy Proxy
Used for load balancing and managing WebSocket connections.
Backend
Haproxy
Previously used load balancer before migrating to Envoy.
Tools
Chef
Used for configuration management of Envoy settings.

Key Actionable Insights

1
Implement hot restarts in Envoy to maintain connection stability during configuration changes.
This is crucial for applications with long-lived connections, such as WebSockets, where dropping connections can lead to poor user experiences.
2
Utilize weighted routing for DNS to manage gradual traffic shifts during migrations.
This approach minimizes risk by allowing for controlled rollouts and quick rollbacks if issues arise, ensuring a smooth transition for users.
3
Leverage Chef for managing Envoy configurations to streamline deployment processes.
Building custom Chef resources can simplify the management of complex Envoy configurations, reducing errors and improving deployment speed.

Common Pitfalls

1
Failing to account for differences in timeout values and headers during migration can lead to service disruptions.
Such discrepancies can cause unexpected behavior in applications relying on specific configurations, highlighting the importance of thorough testing and validation.
2
Neglecting to establish a testing framework for load balancer configurations can complicate migrations.
Without automated tests, teams may miss critical behaviors that applications depend on, leading to potential outages or degraded performance.

Related Concepts

Load Balancing Techniques
Websocket Communication
Configuration Management With Chef
DNS Management During Migrations