Overview
The article discusses the open-sourcing of Iris and Oncall, two systems developed by LinkedIn to enhance incident management and escalation processes. It highlights the challenges faced in manual escalation and how Iris automates this process, improving response times and reliability.
What You'll Learn
1
How to implement an automated incident escalation system using Iris
2
Why modular architecture is crucial for scalable incident management tools
3
When to use Oncall for managing on-call schedules effectively
Key Questions Answered
How does Iris automate incident escalation at LinkedIn?
Iris automates incident escalation by allowing users to define specific escalation plans that it follows automatically during incidents. This reduces ambiguity and ensures timely responses by sending notifications based on configured priorities and contact preferences.
What challenges did LinkedIn face before implementing Iris?
Before Iris, LinkedIn's incident escalation relied on manual processes, which were ambiguous and slow. The NOC engineers struggled to determine the right contacts for escalation, especially as the volume of alerts increased significantly, leading to inefficiencies in incident response.
What role does Oncall play in the incident management system?
Oncall serves as the source of truth for determining who is on-call for specific teams, allowing managers to define rotating schedules and manage shifts efficiently. It provides a clean UI for scheduling and is beneficial even for teams that do not own critical applications.
What are the key features of Iris's architecture?
Iris's architecture is modular, allowing for pluggable external messaging services like Twilio. It tracks incidents through a REST API and ensures reliable message delivery while maintaining flexibility in how users are contacted based on their preferences.
Key Statistics & Figures
Incidents handled by Iris daily
hundreds
Iris has grown to handle hundreds of incidents a day since its implementation.
Major outages experienced by Iris
1
Iris has experienced only one major outage in its lifetime at LinkedIn, highlighting its reliability.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Incident Management
Iris
An automated system for incident escalation and messaging.
Scheduling Tool
Oncall
A tool for managing on-call schedules and shifts.
Messaging Service
Twilio
Used for delivering notifications in the Iris system.
Key Actionable Insights
1Implementing an automated escalation system like Iris can significantly reduce response times during incidents.By automating the escalation process, teams can ensure that incidents are acknowledged quickly, minimizing downtime and improving overall service reliability.
2Utilizing Oncall for scheduling can streamline on-call management and reduce the burden on teams.Oncall allows for easy management of shifts and provides a clear overview of who is responsible at any given time, making it easier to handle incidents without confusion.
3Regularly tuning your alerting system is essential to prevent alert fatigue and ensure effective incident management.As seen with Iris, addressing the underlying issues of noisy alerts can improve the reliability of incident response and reduce unnecessary escalations.
Common Pitfalls
1
Relying on manual escalation processes can lead to delays and confusion during incidents.
Manual processes are often ambiguous and slow, which can exacerbate the impact of incidents. Transitioning to automated systems like Iris can mitigate these issues.
Related Concepts
Incident Management Systems
Automated Escalation Processes
On-call Scheduling Tools
Alerting System Optimization