Tracing at Slack: Thinking in Causal Graphs

Suman Karumuri

“Why is it slow?” is the hardest problem to debug in a complex distributed system like Slack. To diagnose a slow-loading channel with over a hundred thousand users, we’d need to look at client-side metrics, server-side metrics, and logs. It could be a client-side issue: a slow network connection or hardware. On the other hand,…

Slack

•

Suman Karumuri

•20 min read•advanced•

--

•View Original

ChefElasticsearchJavaJavaScriptJenkinsJSONPrometheusPythonSQLTypeScript

Overview

The article discusses Slack's approach to distributed tracing using causal graphs, focusing on the limitations of traditional tracing systems and the development of a new data structure called SpanEvent. It highlights the architecture, implementation, and benefits of their tracing system, which enables better performance analysis and debugging.

What You'll Learn

1

How to implement a tracing system using causal graphs

2

Why traditional tracing APIs may not fit all use cases

3

How to generate and report SpanEvents in a tracing system

Prerequisites & Requirements

Understanding of distributed systems and tracing concepts
Familiarity with SQL for querying trace data(optional)

Key Questions Answered

What are the limitations of traditional distributed tracing systems?

Traditional distributed tracing systems often struggle with flexibility, especially in contexts where there is no clear start or end for operations, such as in mobile applications or complex workflows. They can also be too heavy for simpler use cases, making them less suitable for certain applications.

How does Slack's tracing system improve performance analysis?

Slack's tracing system uses causal graphs and SpanEvents to provide a more granular view of request processing. This allows for easier querying of trace data and better insights into performance issues, enabling engineers to quickly triage and resolve incidents.

What is a SpanEvent and how is it structured?

A SpanEvent is a core component of Slack's tracing system, representing an event in a causal graph. It includes fields such as Id, Timestamp, Duration, Parent Id, Trace Id, Name, Type, Tags, and Span type, allowing for detailed tracking of operations across services.

What goals did Slack aim to achieve with their tracing system?

Slack aimed to create a tracing system that is useful across various platforms, provides a simple API for non-backend use cases, allows real-time incident triaging, enables querying of raw span data, and offers a visual query language for trace analytics.

Key Statistics & Figures

Daily traces processed

310M traces/day

This statistic reflects the scale at which Slack's tracing system operates, highlighting its capability to handle a large volume of data.

Daily spans generated

8.5B spans/day

This demonstrates the granularity of tracking that Slack's tracing system achieves, allowing for detailed performance analysis.

Trace data produced daily

2Tb of trace data

This volume of data underscores the comprehensive nature of Slack's tracing efforts and their commitment to performance monitoring.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Monitoring

Prometheus

Used for metrics collection and monitoring within Slack's infrastructure.

Search

Elasticsearch

Utilized for log querying to provide visibility into performance issues.

Query Engine

Presto

Used for complex analytical queries over trace data in the data warehouse.

Messaging

Kafka

Serves as the event bus for routing logs, events, and metrics.

Analytics

Honeycomb

Used for real-time visualization and analysis of trace data.

Key Actionable Insights

1
Implementing a tracing system using causal graphs can significantly enhance your ability to diagnose performance issues in distributed applications.
By adopting a causal graph model, you can simplify the process of tracking requests across services, making it easier to identify bottlenecks and optimize performance.

2
Utilizing SQL for querying trace data allows for flexible and powerful analysis of performance metrics.
This approach not only makes it easier for engineers to extract insights but also leverages existing SQL knowledge, reducing the learning curve for new team members.

3
Adopting a lightweight API for generating SpanEvents can facilitate gradual integration of tracing into existing applications.
This allows teams to start tracing without overhauling their entire architecture, making it easier to adopt best practices incrementally.

Common Pitfalls

1

Over-reliance on traditional tracing APIs can lead to confusion and inefficiencies in tracing non-standard applications.

Many existing tracing frameworks are designed for backend services, making them unsuitable for client applications or scripts, which can complicate the tracing process.

2

Failing to query raw trace data can limit the insights gained from tracing efforts.

Without the ability to run complex queries on trace data, teams may miss critical performance issues that could be identified through deeper analysis.

Related Concepts

Distributed Tracing

Causal Graphs

Spanevents

Performance Monitoring

Slack launched in 2014 with a PHP 5 backend. Along with several other companies, we switched to HHVM in 2016 because it ran our PHP code faster. We stayed with HHVM because it offers an entirely new language: Hack (searchable as Hacklang). Hack makes our developers faster by improving productivity through better tooling. Hack began as a superset of PHP, retaining its best…

TypeScriptJavaScriptJava

10 min read

Includes Code

Has Summary

--

Slack

Advanced

Syscall Auditing at Scale

If you are are an engineer whose organization uses Linux in production, I have two quick questions for you: 1) How many unique outbound TCP connections have your servers made in the past hour? 2) Which processes and users initiated each of those connections? If you can answer both of these questions, fantastic! You can skip the…

TypeScriptElasticsearchJenkins

11 min read

Includes Code

Has Summary

--

Slack

Advanced

Client Tracing: Understanding Mobile and Desktop Application Performance at Scale

A customer writes in and says the dreaded words: “My app is slow”. Here we go… Performance problems can be a real struggle to track down, especially if they aren’t easily reproducible. Looking at the customer’s logs, you see that it takes over 1.5 seconds to switch between channels on their Android client! That must…

TypeScriptJavaScriptJava

21 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Tracing at Slack: Thinking in Causal Graphs". Explore more engineering insights on TypeScript, JavaScript, Elasticsearch.