Spotify’s Event Delivery – The Road to the Cloud (Part I)

Igor Maravić

Spotify

•

Igor Maravić

•10 min read•intermediate•

--

•View Original

Google Cloud

Overview

The article discusses Spotify's event delivery system, detailing its architecture, challenges, and the transition to a cloud-based solution using Google Cloud managed services. It highlights the importance of event delivery in product design and data infrastructure, as well as the lessons learned from the current system's operation.

What You'll Learn

1

How to ensure reliable event delivery in distributed systems

2

Why using a centralized data storage like Hadoop can create bottlenecks

3

When to consider a complete rewrite of a tightly coupled system

Prerequisites & Requirements

Understanding of event-driven architecture and data pipelines
Familiarity with Hadoop and Kafka(optional)

Key Questions Answered

How does Spotify's event delivery system ensure data completeness?

Spotify's event delivery system ensures data completeness by using syslog timestamps for events and relying on a centralized Hadoop cluster for data storage. However, this design creates a single point of failure, as the system stalls if Hadoop is down, necessitating careful management of disk space across service hosts.

What challenges does Spotify face with its current event delivery system?

Spotify's current event delivery system faces challenges such as latency issues due to reliance on centralized storage, potential outages from tightly coupled components, and the need for manual intervention when end-of-file markers are missing. These issues have prompted a consideration for a complete system rewrite.

What is the role of Kafka in Spotify's event delivery system?

Kafka serves as the backbone for Spotify's event delivery system, facilitating the transfer of log lines from producers to consumers. The system uses Kafka topics to manage event streams, although it has limitations in handling different quality of service due to the current design.

How does Spotify handle event transformation in its data pipeline?

Spotify uses an Extract, Transform, Load (ETL) job to convert data from tab-separated text format to Avro format. This transformation is crucial for ensuring that data is structured and can be efficiently processed in Hadoop, although it currently adds latency to event delivery.

Key Statistics & Figures

Events delivered per second

700,000

Spotify's system can reliably push over 700,000 events per second, showcasing its capacity to handle large volumes of data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Kafka

Used for transferring log lines from producers to consumers in the event delivery system.

Backend

Hadoop

Centralized storage for event data, where events are persisted and processed.

Data Format

Avro

Format used for storing delivered data in Hadoop, allowing for efficient data processing.

Data Processing

Crunch

Framework used for implementing the ETL job that transforms data into Avro format.

Key Actionable Insights

1
Implement a robust monitoring system to track event delivery and data completeness.
By actively monitoring the event delivery process, teams can quickly identify and address bottlenecks or failures, ensuring that data flows smoothly and reliably through the system.

2
Consider transitioning to a more decentralized architecture to reduce latency and improve resilience.
Decentralizing components can help alleviate the single point of failure associated with centralized systems like Hadoop, allowing for more flexible and responsive data handling.

3
Utilize structured data formats like Avro to minimize transformation latency.
Sending data in a structured format can streamline the processing pipeline, reducing the time it takes for data to become available for analysis and improving overall system performance.

Common Pitfalls

1

Relying on a centralized system like Hadoop can create significant bottlenecks and single points of failure.

When the centralized system is down, the entire event delivery process stalls, leading to potential data loss and delays. To mitigate this, consider implementing a more distributed architecture.

2

Using unstructured data formats can introduce unnecessary latency in data processing.

Transforming unstructured data adds overhead to the event delivery process, which can delay the availability of data for analysis. Opting for structured formats can streamline this process.

Related Concepts

Event-driven Architecture

Data Pipelines

Distributed Systems

Data Completeness Challenges