Tracing Notifications

Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users.  Notifications flow through almost all the systems in our infrastructure. As illustrated in Figure 1 below, a notification request…

Suman Karumuri
13 min readbeginner
--
View Original

Overview

The article discusses the complexities and solutions involved in tracing notifications within Slack's infrastructure. It highlights the importance of notifications for user experience and details the challenges faced in debugging notification issues, ultimately leading to a standardized tracing system that enhances customer support and analytics.

What You'll Learn

1

How to trace notifications across multiple systems in a standardized format

2

Why modeling notification flows as traces improves debugging efficiency

3

How to utilize trace data for customer support and analytics

Prerequisites & Requirements

  • Understanding of tracing and logging mechanisms
  • Familiarity with backend systems and notification workflows(optional)

Key Questions Answered

How does Slack trace notifications across its systems?
Slack traces notifications by modeling each notification as its own trace, using span links to connect them. This approach allows for 100% sampling of notification flows, ensuring that all notifications are accounted for, even when sent to multiple users across different devices.
What challenges did Slack face in debugging notification issues?
Slack encountered difficulties due to different logging pipelines and data formats across systems, which complicated the debugging process. This often required deep technical expertise and could take several days to resolve, leading to low customer satisfaction scores related to notifications.
What advantages does modeling notification flows as traces provide?
Modeling notification flows as traces offers consistent data formats, simplifies instrumentation, and allows for better performance analytics. This method also helps in preserving causality across different notification events, making it easier to understand and debug issues.
How is notification trace data used for customer support at Slack?
Notification trace data is utilized by the customer experience team to quickly triage issues related to notifications. With a clearer view of the notification flow, support engineers can identify where a notification dropped, reducing the time to resolve customer tickets by 30%.

Key Statistics & Figures

Reduction in time to triage notification tickets
30%
This improvement was achieved through the implementation of notification tracing, allowing customer experience engineers to quickly identify issues.
Sampling rate for notification flows
100%
Unlike backend requests, which were sampled at 1%, notification flows require full fidelity to ensure all notifications are captured.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a standardized tracing system for notifications can significantly reduce debugging time.
By using a unified data format for tracing notifications, teams can quickly identify issues and improve response times, enhancing overall customer satisfaction.
2
Utilizing trace data for analytics can provide deeper insights into user engagement with notifications.
Data scientists at Slack have leveraged trace data to analyze notification open rates and performance regressions, leading to better product decisions.
3
Decoupling trace context from request context simplifies the tracing process.
This approach allows for more flexible data modeling and easier integration across different systems, which is crucial for maintaining a robust notification infrastructure.

Common Pitfalls

1
Failing to standardize logging formats across different systems can lead to prolonged debugging sessions.
When systems use different logging pipelines, it complicates the process of tracing notifications, making it difficult to pinpoint where issues occur.
2
Not implementing 100% sampling for critical notification flows can result in lost data.
Sampling at lower rates may miss important notifications, especially in scenarios where many users are notified simultaneously, leading to incomplete data for analysis.

Related Concepts

Tracing And Logging Mechanisms
Notification Workflows
Data Analytics And Performance Monitoring