Implementing ChatOps into our Incident Management Procedure

8 minute read Production engineers (PE) are expected to be incident management experts. Still, incident handling is difficult, often messy, and exhausting. We encounter new incidents, search high and low for possible explanations, sometimes tunnel on symptoms, and, under pressure, forget some best practices. At Shopify, we care not only about handling incidents quickly and efficiently, but also PE well-being. We have a special IMOC (incident manager on call) rotation and an incident chatbot to assist IMOCs. This post provides an overview of incident management at Shopify, the responsibility of different roles during an incident, and how our chatbot works to support our team.

Daniella Niyonkuru
6 min readbeginner
--
View Original

Overview

The article discusses the implementation of ChatOps at Shopify to enhance incident management procedures, focusing on the role of the Incident Manager on Call (IMOC) and the integration of a chatbot named Spy. It outlines the incident response process, the responsibilities of various roles, and how Spy assists in streamlining communication and actions during incidents.

What You'll Learn

1

How to integrate a chatbot into incident management processes

2

Why effective communication is crucial during incident response

3

When to utilize specific commands in ChatOps for incident management

Prerequisites & Requirements

  • Understanding of incident management principles
  • Familiarity with Slack and third-party tools like PagerDuty and GitHub(optional)

Key Questions Answered

What is the role of the Incident Manager on Call (IMOC) during an incident?
The IMOC leads the incident response, focusing on communication and ensuring that the response progresses effectively. They coordinate with other teams, confirm fixes, and document service disruptions, while not directly fixing production issues themselves.
How does the Spy chatbot assist during incident management?
Spy assists the IMOC by providing commands that streamline incident response, such as notifying team members, binding incidents to discussion channels, and facilitating communication with third-party services. It reduces manual effort and context switching during incidents.
What are the steps involved in the incident response process at Shopify?
The incident response process includes detecting failures, starting an incident, ensuring communication, fixing and mitigating issues, stopping the incident, and documenting the service disruption. Each step is crucial for effective incident management.
What commands does the Spy chatbot provide for incident management?
Spy provides commands such as 'spy page' to alert the IMOC, 'spy start incident' to initiate an incident, and 'spy incident tldr' to summarize ongoing incidents. These commands help streamline the incident response process.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Integrate a chatbot like Spy into your incident management workflow to enhance communication and streamline processes.
Using a chatbot can reduce manual tasks and improve the efficiency of incident responses, allowing teams to focus on resolving issues rather than managing communication.
2
Establish clear roles and responsibilities for incident response to avoid confusion during critical situations.
Defining roles such as IMOC and Support Response Manager ensures that all team members know their responsibilities, which is vital for effective incident management.
3
Utilize ChatOps to keep all incident-related discussions in one channel to maintain focus and clarity.
Centralizing communication in a dedicated channel prevents parallel discussions and confusion, ensuring that everyone involved has a shared understanding of the incident.

Common Pitfalls

1
Relying on memory for incident response steps can lead to mistakes and inefficiencies.
Without a structured approach or tools like Spy, team members may forget critical steps or best practices, resulting in prolonged incident resolution times.
2
Failing to communicate effectively during an incident can create confusion and delays.
When communication is scattered across multiple channels, it can lead to misunderstandings about who is responsible for what, hindering the incident response process.