Overview
This article discusses Spotify's transition to a new event delivery system built on Google Cloud managed services, focusing on the architecture and design choices made to improve reliability and efficiency. It highlights the challenges faced with the previous system and the decision-making process that led to the adoption of Cloud Pub/Sub over Kafka.
What You'll Learn
1
How to design a reliable event delivery system using Cloud Pub/Sub
2
Why to choose managed services over self-hosted solutions for event delivery
3
How to implement performance testing for cloud services
Prerequisites & Requirements
- Understanding of event-driven architecture and cloud services
- Experience with event delivery systems and cloud computing(optional)
Key Questions Answered
What are the main components of Spotify's new event delivery system?
The new event delivery system consists of four main components: the File Tailer, the Event Delivery Service, the Reliable Persistent Queue, and the ETL job. Each component plays a specific role in ensuring reliable transport and processing of events from clients to the central system.
How does Cloud Pub/Sub compare to Kafka for event delivery?
Cloud Pub/Sub offers global availability, a simple REST API, and operational management by Google, making it a more attractive option compared to Kafka 0.8, which faced stability issues and required significant operational overhead. Pub/Sub's ability to retain undelivered data for 7 days and provide reliability through application-level acknowledgments further enhances its appeal.
What performance metrics were achieved during the testing of Cloud Pub/Sub?
During testing, Spotify was able to publish 2 million messages per second without service degradation and observed almost no server errors from the Pub/Sub backend. This performance was achieved while maintaining low and consistent latency, confirming the system's capability to handle high loads.
What challenges did Spotify face with Kafka 0.8?
Spotify encountered several challenges with Kafka 0.8, including instability with the Kafka Producer and Mirror Maker, which failed to reliably mirror data between data centers. These issues necessitated a reevaluation of their event delivery approach, ultimately leading to the decision to explore Cloud Pub/Sub.
Key Statistics & Figures
Peak event publishing rate
2 million events per second
This was the test load used to evaluate Cloud Pub/Sub's performance under heavy load.
Average event publishing rate during consumer stability test
800K messages per second
This rate was maintained during the test to mimic real-world load variations.
End-to-end latency during testing
20 seconds
This was the median latency measured, including backlog recovery.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Cloud Pub/Sub
Used as the reliable persistent queue for event delivery.
Backend
Kafka 0.8
Initially considered for the event delivery system but faced significant challenges.
Backend
Apollo
Framework used to build the Event Delivery Service.
Backend
Helios
Orchestration platform used for deploying the Event Delivery Service.
Key Actionable Insights
1Consider using managed services like Cloud Pub/Sub for event delivery to reduce operational overhead and improve reliability.Managed services can handle scaling and maintenance, allowing your team to focus on building features rather than managing infrastructure.
2Implement thorough performance testing when transitioning to new cloud services to ensure they meet your scalability needs.Testing under expected loads can help identify potential bottlenecks and ensure that the new system can handle future growth.
3Utilize a structured approach to event types by creating separate channels or topics for each event.This practice enhances the efficiency of real-time use cases and simplifies processing by reducing the complexity of handling multiple event types in a single stream.
Common Pitfalls
1
Relying on self-hosted solutions like Kafka without adequate operational planning can lead to significant instability.
Spotify's experience with Kafka demonstrated that without proper deployment strategies and monitoring, the system could fail under production loads.
Related Concepts
Event-driven Architecture
Cloud Services
Performance Testing
Dataflow