Overview
The article discusses the evolution of Pushy, Netflix's WebSocket server, which has scaled to handle hundreds of millions of concurrent connections while maintaining a 99.999% message delivery reliability. It details the motivations behind its development, the technical enhancements made to support growth, and future directions for the platform.
What You'll Learn
1
How to scale a WebSocket server to handle millions of concurrent connections
2
Why using direct push improves message delivery feedback
3
How to implement device to device messaging using WebSockets
4
When to use caching to reduce latency in message delivery
Prerequisites & Requirements
- Understanding of WebSocket protocols and message delivery systems
- Familiarity with Spring Boot and Netty(optional)
Key Questions Answered
How does Pushy handle hundreds of millions of concurrent WebSocket connections?
Pushy has evolved to handle hundreds of millions of concurrent connections by revisiting its design decisions and implementing features like automatic horizontal scaling, improved message processing, and enhanced connection management. This allows it to maintain high availability and a consistent message delivery rate.
What improvements have been made to Pushy's message processor?
The message processor was rewritten as a standalone Spring Boot service, allowing for automatic horizontal scaling, canary deployments, and better observability. This change has made the message processing more flexible and reliable, reducing the need for manual intervention during updates.
What is the significance of direct push in Pushy's messaging system?
Direct push allows backend services to send messages directly to Pushy without going through an asynchronous queue, providing immediate feedback on message delivery success. This has become increasingly important as the needs of backend services have evolved.
How does Pushy ensure message delivery reliability?
Pushy maintains a 99.999% message delivery reliability rate by implementing robust connection management, additional heartbeats, and idle connection cleanup. These measures help reduce stale connections and improve overall message delivery performance.
Key Statistics & Figures
Concurrent WebSocket connections
hundreds of millions
Pushy has scaled to handle hundreds of millions of concurrent WebSocket connections.
Message delivery reliability
99.999%
Pushy has maintained a message delivery reliability rate of 99.999% over recent months.
Messages sent per second
300,000
Pushy regularly reaches 300,000 messages sent per second.
Direct messages sent per second
160,000
In a recent 24-hour period, direct messages averaged around 160,000 messages per second.
Device to device messages per second
1,000
Currently, Pushy sees an average of 1,000 device to device messages per second.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Protocol
Websocket
Used for maintaining persistent connections between devices and the server.
Backend Framework
Spring Boot
Used to rewrite the message processor for better scalability and reliability.
Networking Framework
Netty
Used for handling incoming WebSocket messages in Pushy.
Messaging System
Kafka
Used for sending device connection events to track devices for messaging.
Database
Dynomite
Previously used for managing device connection metadata before migrating to KeyValue.
Database
Keyvalue
Current storage solution for device connection metadata, providing low latency and scalability.
Key Actionable Insights
1Implementing direct push can significantly enhance the responsiveness of your messaging system.By allowing backend services to send messages directly to connected devices, you can reduce latency and improve user experience, especially for interactive applications.
2Utilizing caching strategies can drastically lower message delivery latency.By caching frequently accessed data, such as device connection information, you can minimize lookups and speed up message processing times, leading to a more efficient system.
3Investing in observability tools is crucial for maintaining high reliability in messaging systems.Metrics around message delivery rates and connection management can help identify issues early and optimize performance, ensuring a stable and reliable service.
4Consider the trade-offs of increasing connection limits per node in your WebSocket architecture.While increasing connection limits can reduce the number of instances needed, it also raises the risk of a 'thundering herd' problem during node failures, which can overwhelm your system.
Common Pitfalls
1
Neglecting the importance of connection management can lead to high latency and message delivery failures.
Without proper connection handling and cleanup, stale connections can accumulate, causing delays and failures in message delivery. Implementing heartbeats and idle connection cleanup can mitigate these issues.
2
Overlooking the need for observability can hinder system performance optimization.
Without metrics and monitoring, it becomes challenging to identify performance bottlenecks and reliability issues. Investing in observability tools is essential for maintaining a high-performance messaging system.
Related Concepts
Websocket Protocols
Message Delivery Systems
Caching Strategies
Scalability In Distributed Systems