Spotify’s Event Delivery – The Road to the Cloud (Part III)

Igor Maravić
11 min readadvanced
--
View Original

Overview

This article discusses Spotify's transition to a cloud-based event delivery system, focusing on the architecture and implementation using Google Cloud services. It highlights the use of Cloud Pub/Sub for event transport and Dataflow for processing, along with the challenges and performance metrics observed during the implementation.

What You'll Learn

1

How to use Dataflow for real-time event processing

2

Why partitioning data into hourly buckets improves data management

3

When to apply windowing concepts in streaming data pipelines

Prerequisites & Requirements

  • Basic understanding of data processing concepts and cloud services(optional)
  • Familiarity with Google Cloud services, particularly Cloud Pub/Sub and Dataflow
  • Experience with ETL processes and data pipelines(optional)

Key Questions Answered

How does Spotify ensure reliable event delivery from clients to servers?
Spotify uses a new event delivery system based on Google Cloud managed services, specifically utilizing Cloud Pub/Sub for event transport and Dataflow for processing. This architecture ensures that events are reliably sent from clients worldwide to a central processing system.
What challenges did Spotify face while implementing the Dataflow ETL job?
Spotify encountered challenges such as determining the optimal number of workers for the ETL job, managing in-flight data during job restarts, and understanding watermark behavior for event timeliness. These issues required ongoing collaboration with Dataflow engineers to find effective solutions.
What performance improvements were observed with the new event delivery system?
The new event delivery system demonstrated a worst-case end-to-end latency that is four times lower than the previous system. This improvement highlights the effectiveness of transitioning to a cloud-based architecture using Google Cloud services.
How does Dataflow handle late-arriving data in event processing?
Dataflow manages late-arriving data by writing it to the currently open hourly bucket while shifting the event timestamp forward. This approach ensures that late data is still processed without affecting the integrity of already completed buckets.

Key Statistics & Figures

Number of ETL jobs running
4
Spotify currently has four ETL jobs handling different event types.
Peak event processing rate
100k events per second
The largest ETL job peaks at around 100k events per second.
Watermark lag
mostly below 200s
~3.5 mins

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Cloud Pub/Sub
Used for transporting events from clients to the central processing system.
Backend
Dataflow
Utilized for processing events and managing ETL jobs.
Storage
Cloud Storage
Replacing HDFS for persistent storage of exported events.
Database
Bigquery
Replacing Hive for querying and analyzing exported event data.

Key Actionable Insights

1
Implementing a streaming ETL job using Dataflow can significantly reduce data processing latency.
By continuously running the ETL job, Spotify can incrementally fill hourly buckets as data arrives, leading to better performance compared to traditional batch jobs.
2
Utilizing Cloud Pub/Sub for event transport can simplify the architecture of distributed systems.
Cloud Pub/Sub provides a reliable messaging service that decouples event producers from consumers, making it easier to scale and manage event-driven architectures.
3
Understanding windowing concepts in Dataflow is crucial for effective event processing.
Dataflow's ability to create windows based on event timestamps allows for more accurate data partitioning and processing, which is essential for real-time analytics.

Common Pitfalls

1
Failing to manage in-flight data during job restarts can lead to data loss.
It's crucial to implement mechanisms that ensure data integrity when jobs are updated or restarted, as Spotify is currently facing challenges in this area.
2
Not understanding watermark behavior can result in inaccurate event timeliness detection.
Watermark propagation is not synchronized between transforms, which can lead to falsely detected late events if not properly monitored.

Related Concepts

Event-driven Architecture
Streaming Data Processing
Cloud-based Etl Solutions