Instant Messaging at LinkedIn: Scaling to Hundreds of Thousands of Persistent Connections on One Machine

Akhilesh Gupta
12 min readadvanced
--
View Original

Overview

The article discusses how LinkedIn scaled its Instant Messaging service to handle hundreds of thousands of persistent connections on a single machine. It highlights the use of Server-sent events, the Play Framework, and the Akka Actor Model to manage these connections efficiently, along with insights from load testing and optimization techniques.

What You'll Learn

1

How to implement Server-sent events for real-time messaging

2

Why using the Akka Actor Model can optimize connection management

3

When to perform load testing with production traffic patterns

Prerequisites & Requirements

  • Understanding of persistent connections and real-time messaging
  • Familiarity with Play Framework and Akka(optional)

Key Questions Answered

How does LinkedIn manage hundreds of thousands of persistent connections?
LinkedIn uses Server-sent events (SSE) along with the Play Framework and Akka Actor Model to manage persistent connections. This approach allows the server to push data to clients in real-time without requiring repeated requests, enabling efficient handling of concurrent connections.
What limits did LinkedIn encounter during load testing?
During load testing, LinkedIn faced several limits including the maximum number of pending connections on a socket, JVM thread count, ephemeral port exhaustion, file descriptor limits, and JVM heap space. Each of these limits required specific optimizations to maintain performance and scalability.
Why did LinkedIn choose Server-sent events over Websockets initially?
LinkedIn opted for Server-sent events due to their compatibility with traditional HTTP, which provided maximum accessibility for users across various networks. Although Websockets offer more robust bi-directional communication, SSE was chosen for its initial simplicity and compatibility.

Key Statistics & Figures

Maximum connections per machine
100,000
This was achieved after optimizing various system limits during load testing.
JVM heap space
16GB
The heap space was increased to accommodate more persistent connections without running out of memory.
File descriptor limit
200,000
This limit was adjusted to allow more simultaneous connections on each node.

Technologies & Tools

Protocol
Server-sent Events
Used for pushing real-time updates to clients without requiring repeated requests.
Backend Framework
Play Framework
Facilitates the development of server applications and supports SSE and Websockets.
Actor Model Framework
Akka
Manages the lifecycle of connections and enables asynchronous processing.

Key Actionable Insights

1
Increase the net.core.somaxconn kernel parameter to handle more simultaneous connections.
This adjustment can prevent connection refusals when the server is under heavy load, ensuring that more clients can connect simultaneously.
2
Utilize the Akka Actor Model to manage each connection efficiently.
By assigning an Actor to each connection, you can leverage asynchronous processing, which enhances performance and scalability in high-load scenarios.
3
Conduct load testing with real production traffic to identify bottlenecks.
Simulated traffic may not reveal all issues; testing with actual user patterns provides insights into system limits and necessary optimizations.

Common Pitfalls

1
Failing to adjust the net.core.somaxconn parameter can lead to connection refusals.
If this parameter is not increased, the server may reject new connections when the backlog queue is full, limiting the number of simultaneous users.
2
Running out of JVM threads due to thread leaks.
If the application creates too many threads without releasing them, it can lead to OutOfMemoryError. Regular monitoring and profiling are essential to identify and fix such issues.

Related Concepts

Real-time Messaging
Load Testing Techniques
Concurrency Management With Akka