Tracing at Slack: Thinking in Causal Graphs

“Why is it slow?” is the hardest problem to debug in a complex distributed system like Slack. To diagnose a slow-loading channel with over a hundred thousand users, we’d need to look at client-side metrics, server-side metrics, and logs. It could be a client-side issue: a slow network connection or hardware. On the other hand,…

Overview

The article discusses Slack's approach to distributed tracing using causal graphs, focusing on the limitations of traditional tracing systems and the development of a new data structure called SpanEvent. It highlights the architecture, implementation, and benefits of their tracing system, which enables better performance analysis and debugging.

What You'll Learn

1

How to implement a tracing system using causal graphs

2

Why traditional tracing APIs may not fit all use cases

3

How to generate and report SpanEvents in a tracing system

Prerequisites & Requirements

  • Understanding of distributed systems and tracing concepts
  • Familiarity with SQL for querying trace data(optional)

Key Questions Answered

What are the limitations of traditional distributed tracing systems?
Traditional distributed tracing systems often struggle with flexibility, especially in contexts where there is no clear start or end for operations, such as in mobile applications or complex workflows. They can also be too heavy for simpler use cases, making them less suitable for certain applications.
How does Slack's tracing system improve performance analysis?
Slack's tracing system uses causal graphs and SpanEvents to provide a more granular view of request processing. This allows for easier querying of trace data and better insights into performance issues, enabling engineers to quickly triage and resolve incidents.
What is a SpanEvent and how is it structured?
A SpanEvent is a core component of Slack's tracing system, representing an event in a causal graph. It includes fields such as Id, Timestamp, Duration, Parent Id, Trace Id, Name, Type, Tags, and Span type, allowing for detailed tracking of operations across services.
What goals did Slack aim to achieve with their tracing system?
Slack aimed to create a tracing system that is useful across various platforms, provides a simple API for non-backend use cases, allows real-time incident triaging, enables querying of raw span data, and offers a visual query language for trace analytics.

Key Statistics & Figures

Daily traces processed
310M traces/day
This statistic reflects the scale at which Slack's tracing system operates, highlighting its capability to handle a large volume of data.
Daily spans generated
8.5B spans/day
This demonstrates the granularity of tracking that Slack's tracing system achieves, allowing for detailed performance analysis.
Trace data produced daily
2Tb of trace data
This volume of data underscores the comprehensive nature of Slack's tracing efforts and their commitment to performance monitoring.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a tracing system using causal graphs can significantly enhance your ability to diagnose performance issues in distributed applications.
By adopting a causal graph model, you can simplify the process of tracking requests across services, making it easier to identify bottlenecks and optimize performance.
2
Utilizing SQL for querying trace data allows for flexible and powerful analysis of performance metrics.
This approach not only makes it easier for engineers to extract insights but also leverages existing SQL knowledge, reducing the learning curve for new team members.
3
Adopting a lightweight API for generating SpanEvents can facilitate gradual integration of tracing into existing applications.
This allows teams to start tracing without overhauling their entire architecture, making it easier to adopt best practices incrementally.

Common Pitfalls

1
Over-reliance on traditional tracing APIs can lead to confusion and inefficiencies in tracing non-standard applications.
Many existing tracing frameworks are designed for backend services, making them unsuitable for client applications or scripts, which can complicate the tracing process.
2
Failing to query raw trace data can limit the insights gained from tracing efforts.
Without the ability to run complex queries on trace data, teams may miss critical performance issues that could be identified through deeper analysis.

Related Concepts

Distributed Tracing
Causal Graphs
Spanevents
Performance Monitoring