Open Sourcing iris-message-processor

Diego Cepeda
9 min readintermediate
--
View Original

Overview

The article discusses the open-sourcing of the iris-message-processor, a tool developed at LinkedIn to enhance incident management and message processing. It highlights the significant performance improvements achieved through the new architecture, which allows for scalable and efficient handling of escalations and messages.

What You'll Learn

1

How to implement the iris-message-processor for scalable message handling

2

Why re-architecting services can improve performance and reliability

3

How to manage on-call escalations effectively using open-source tools

Prerequisites & Requirements

  • Understanding of incident management systems
  • Familiarity with GitHub for accessing open-source repositories(optional)

Key Questions Answered

What improvements were made to the iris-message-processor compared to the previous design?
The iris-message-processor introduced a fully distributed architecture that allows for horizontal scaling, reducing reliance on a single leader node. This change resulted in processing speeds that were up to 86.6 times faster under high load conditions, addressing previous bottlenecks and improving reliability.
How does the iris-message-processor handle message processing differently?
The iris-message-processor splits escalations into buckets assigned to multiple nodes, allowing concurrent processing. This design eliminates delays caused by serial processing and reduces the load on the database, which is no longer used as a message queue.
What are the performance metrics achieved with the new iris-message-processor?
Under average load, the iris-message-processor was approximately 4.6 times faster than the previous iris-sender. During high load scenarios, it achieved speeds of about 86.6 times faster, demonstrating significant improvements in processing efficiency.
What challenges did LinkedIn face with the original iris-sender design?
The original design faced issues such as delays in processing escalations due to serial handling, reliance on a single leader node, and database deadlocks caused by high message volumes. These challenges prompted the need for a redesign to improve scalability and reliability.

Key Statistics & Figures

Growth in escalations processed monthly
2,300%
This growth reflects the increased integration of Iris within LinkedIn's services over six years.
Average messages sent daily by Iris
700,000
This statistic highlights the scale at which Iris operates, with bursts exceeding 3,000 messages per second.
Performance improvement under high load
86.6x faster
This metric compares the iris-message-processor's performance to that of the previous iris-sender under high load conditions.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Go
Used to develop the iris-message-processor for improved performance and scalability.
Database
Galera
Initially used for strong consistency in the database but faced challenges under high load.

Key Actionable Insights

1
Transitioning to a distributed architecture can significantly enhance system performance and reliability.
By adopting the iris-message-processor, organizations can avoid bottlenecks associated with single-node processing, leading to faster response times and improved service reliability.
2
Utilizing open-source tools like iris and Oncall can provide cost-effective solutions for incident management.
These tools offer flexibility and customization, making them suitable alternatives to commercial incident response platforms.
3
Regularly testing and load balancing your systems can prevent performance degradation during peak usage.
The iris-message-processor's design allows for automatic rebalancing, ensuring consistent performance even under high load conditions.

Common Pitfalls

1
Relying on a single leader node can create significant bottlenecks in processing.
This design flaw can lead to delays and outages, especially under high load, necessitating a more distributed approach.
2
Using a database as a message queue can lead to performance issues and deadlocks.
As message volumes increase, this practice can overwhelm the database, causing intermittent failures and slowdowns.

Related Concepts

Incident Management Systems
Distributed Systems Architecture
Open-source Software Development