Monitoring at Spotify: The Story So Far

John-John Tedro
6 min readintermediate
--
View Original

Overview

The article discusses the evolution of monitoring systems at Spotify, detailing the challenges faced and the solutions implemented to improve operational monitoring. It highlights the transition from Zabbix and a homegrown system to a more scalable and efficient approach using Riemann and the development of Lyceum.

What You'll Learn

1

How to implement alerting systems using Riemann

2

Why a push-based approach for metrics collection is beneficial

3

When to use tags instead of hierarchical naming for time series data

Prerequisites & Requirements

  • Understanding of monitoring systems and distributed architecture
  • Familiarity with Riemann and metric collection tools(optional)

Key Questions Answered

What were the limitations of Zabbix in Spotify's monitoring?
Zabbix was limited by its strong focus on individual hosts, which made it difficult to manage as Spotify's infrastructure grew. Only SRE personnel could operate it, leading to scalability issues as the number of hosts increased significantly.
How did Spotify transition from a pull-based to a push-based metrics collection approach?
Spotify transitioned to a push-based approach to reduce the complexity of defining metrics, allowing engineers to send metrics directly to an API without needing extensive configuration. This method improved the speed of adoption and ensured metrics were captured even from short-lived tasks.
What challenges did Spotify face with Graphite for graphing metrics?
Spotify encountered issues with Graphite's vertical scalability and the complexities involved in sharding and rebalancing data. The hierarchical naming structure of Graphite also posed challenges for filtering and querying metrics effectively.
Why did Spotify choose to develop their own time series database, Heroic?
Spotify developed Heroic after finding performance and stability issues with existing solutions like KairosDB. The decision was driven by the need for a scalable time series database that could support their growing operational monitoring requirements.

Technologies & Tools

Monitoring Tool
Zabbix
Initially used for operational monitoring but found unsuitable for scaling.
Monitoring Tool
Riemann
Chosen for its ability to monitor distributed systems and facilitate alerting.
Library
Lyceum
Built on top of Riemann to manage alerting rules in a git repository.
Monitoring Tool
Munin
Used for metrics collection in the initial monitoring stack.
Metrics Collection Tool
Collectd
Experimented with for gathering metrics in a push-based approach.
Graphing Tool
Graphite
Initially used for graphing metrics but faced scalability issues.
Time Series Database
Heroic
Developed as a scalable solution for time series data.

Key Actionable Insights

1
Implement a push-based metrics collection system to enhance data capture efficiency.
By allowing engineers to send metrics directly to an API, you can simplify the process and ensure metrics are captured even from transient tasks, improving overall monitoring effectiveness.
2
Consider using tags for time series data instead of hierarchical naming.
Tags provide greater flexibility and interoperability, allowing for easier filtering and querying of metrics without the constraints of a fixed hierarchy.
3
Invest in developing a custom monitoring solution if existing tools do not meet your scalability needs.
Spotify's experience shows that building a tailored solution can address specific operational challenges and provide better performance than off-the-shelf products.

Common Pitfalls

1
Relying too heavily on hierarchical naming for time series data can lead to inflexibility.
This approach may restrict the ability to filter and query data effectively, as it requires strict adherence to naming conventions that may not suit all use cases.