Changing the Wheels on a Moving Bus — Spotify’s Event Delivery Migration

Flavio Santos (Data Infrastructure Engineer) and Robert Stephenson (Senior Product Manager)
14 min readintermediate
--
View Original

Overview

This article discusses Spotify's migration of its Event Delivery Infrastructure (EDI) to Google Cloud Platform (GCP), detailing the challenges faced, solutions implemented, and the resulting improvements in data handling and operational efficiency. Key highlights include the increase in event traffic and data ingestion, as well as the strategies employed to address legacy system limitations.

What You'll Learn

1

How to improve data reliability in event-driven architectures

2

Why transitioning to cloud-managed services can enhance operational efficiency

3

When to implement deduplication strategies for event data

4

How to manage legacy systems during infrastructure migration

Prerequisites & Requirements

  • Understanding of event-driven architecture concepts
  • Familiarity with Google Cloud Platform services(optional)

Key Questions Answered

What challenges did Spotify face during the EDI migration?
Spotify encountered several challenges during the EDI migration, including data loss from mobile clients, a lengthy control plane user experience, and the need to maintain backwards compatibility with legacy systems. These issues degraded productivity and necessitated a redesign of the infrastructure to improve data quality and operational efficiency.
How did Spotify handle the transition from legacy systems to the new EDI?
Spotify implemented a data transformation pipeline to manage the transition from legacy systems to the new EDI. This pipeline reads events from legacy clients, converts them, and feeds them into the new infrastructure, allowing for a gradual migration while maintaining data governance principles.
What improvements were made to the EDI's data handling capabilities?
The new EDI improved data handling capabilities by increasing peak traffic from 1.5 million events per second to nearly 8 million and ingesting nearly 70TB of data daily. It also introduced better deduplication strategies and client re-sends to enhance data reliability.
Why is client re-sending important in the new EDI?
Client re-sending is crucial in the new EDI to reduce event loss and improve reliability, especially given the challenges of unstable connections and offline usage. This strategy allows events to be temporarily stored on clients and re-sent according to a defined retry policy, ensuring data integrity.

Key Statistics & Figures

Peak traffic handling capacity
8 million events per second
This was an increase from the previous capacity of 1.5 million events per second.
Daily data ingestion volume
70TB
This volume reflects the total data ingested daily after the migration to the new EDI.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a data transformation pipeline to facilitate gradual migration from legacy systems to new infrastructures.
This approach allows for a smoother transition while maintaining data governance and minimizing disruption to existing services.
2
Adopt client re-sending strategies to enhance event reliability in environments with unstable network connections.
By allowing clients to temporarily store and resend events, you can significantly reduce data loss and improve the overall quality of data collected.
3
Regularly revisit design decisions and assumptions during infrastructure upgrades to identify potential issues early.
This practice helps to ensure that the new systems are robust and can accommodate evolving user needs without compromising performance.

Common Pitfalls

1
Failing to account for the long tail problem during infrastructure migrations can lead to significant data loss and operational challenges.
This occurs because older versions of applications may take time to gain widespread adoption, leaving legacy systems in operation longer than anticipated.

Related Concepts

Event-driven Architecture
Data Governance Principles
Cloud Infrastructure Management