Spotify’s Event Delivery – The Road to the Cloud (Part II)

Igor Maravić

Spotify

•

Igor Maravić

•13 min read•intermediate•

--

•View Original

ApacheApache KafkaEmbeddingGoogle CloudJavaREST API

Overview

This article discusses Spotify's transition to a new event delivery system built on Google Cloud managed services, focusing on the architecture and design choices made to improve reliability and efficiency. It highlights the challenges faced with the previous system and the decision-making process that led to the adoption of Cloud Pub/Sub over Kafka.

What You'll Learn

1

How to design a reliable event delivery system using Cloud Pub/Sub

2

Why to choose managed services over self-hosted solutions for event delivery

3

How to implement performance testing for cloud services

Prerequisites & Requirements

Understanding of event-driven architecture and cloud services
Experience with event delivery systems and cloud computing(optional)

Key Questions Answered

What are the main components of Spotify's new event delivery system?

The new event delivery system consists of four main components: the File Tailer, the Event Delivery Service, the Reliable Persistent Queue, and the ETL job. Each component plays a specific role in ensuring reliable transport and processing of events from clients to the central system.

How does Cloud Pub/Sub compare to Kafka for event delivery?

Cloud Pub/Sub offers global availability, a simple REST API, and operational management by Google, making it a more attractive option compared to Kafka 0.8, which faced stability issues and required significant operational overhead. Pub/Sub's ability to retain undelivered data for 7 days and provide reliability through application-level acknowledgments further enhances its appeal.

What performance metrics were achieved during the testing of Cloud Pub/Sub?

During testing, Spotify was able to publish 2 million messages per second without service degradation and observed almost no server errors from the Pub/Sub backend. This performance was achieved while maintaining low and consistent latency, confirming the system's capability to handle high loads.

What challenges did Spotify face with Kafka 0.8?

Spotify encountered several challenges with Kafka 0.8, including instability with the Kafka Producer and Mirror Maker, which failed to reliably mirror data between data centers. These issues necessitated a reevaluation of their event delivery approach, ultimately leading to the decision to explore Cloud Pub/Sub.

Key Statistics & Figures

Peak event publishing rate

2 million events per second

This was the test load used to evaluate Cloud Pub/Sub's performance under heavy load.

Average event publishing rate during consumer stability test

800K messages per second

This rate was maintained during the test to mimic real-world load variations.

End-to-end latency during testing

20 seconds

This was the median latency measured, including backlog recovery.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Cloud Pub/Sub

Used as the reliable persistent queue for event delivery.

Backend

Kafka 0.8

Initially considered for the event delivery system but faced significant challenges.

Backend

Apollo

Framework used to build the Event Delivery Service.

Backend

Helios

Orchestration platform used for deploying the Event Delivery Service.

Key Actionable Insights

1
Consider using managed services like Cloud Pub/Sub for event delivery to reduce operational overhead and improve reliability.
Managed services can handle scaling and maintenance, allowing your team to focus on building features rather than managing infrastructure.

2
Implement thorough performance testing when transitioning to new cloud services to ensure they meet your scalability needs.
Testing under expected loads can help identify potential bottlenecks and ensure that the new system can handle future growth.

3
Utilize a structured approach to event types by creating separate channels or topics for each event.
This practice enhances the efficiency of real-time use cases and simplifies processing by reducing the complexity of handling multiple event types in a single stream.

Common Pitfalls

1

Relying on self-hosted solutions like Kafka without adequate operational planning can lead to significant instability.

Spotify's experience with Kafka demonstrated that without proper deployment strategies and monitoring, the system could fail under production loads.

Related Concepts

Event-driven Architecture

Cloud Services

Performance Testing

Dataflow