Migrating Critical Traffic At Scale with No Downtime — Part 2

Netflix Technology Blog
12 min readadvanced
--
View Original

Overview

This article discusses the techniques employed by Netflix to migrate critical traffic at scale without downtime. It covers methods such as Sticky Canaries, A/B Testing, and Traffic Dialing, emphasizing their importance in ensuring a seamless customer experience during system migrations.

What You'll Learn

1

How to implement Sticky Canaries for controlled service deployment

2

Why A/B Testing is crucial for validating backend system changes

3

How to use Traffic Dialing to manage risk during migrations

4

When to perform clean-up after a migration to avoid technical debt

Prerequisites & Requirements

  • Understanding of system migration concepts
  • Familiarity with A/B Testing frameworks(optional)

Key Questions Answered

What is the purpose of Sticky Canaries in system migrations?
Sticky Canaries are used to validate changes in a controlled manner by routing traffic consistently to both the canary and baseline clusters. This allows monitoring of performance and detection of regressions across the entire request lifecycle, ensuring that migrations do not disrupt the customer experience.
How does A/B Testing help in backend migrations?
A/B Testing allows Netflix to limit exposure to new system changes by selectively routing a portion of traffic to the updated backend. This controlled approach helps in assessing the impact of changes on user experience and system performance, thus minimizing risks associated with migrations.
What is Traffic Dialing and how is it implemented?
Traffic Dialing is a method used to control the flow of traffic to a new system by sampling inbound requests. It allows for gradual migration by adjusting the percentage of traffic routed to the new path, enabling quick rollbacks if issues are detected during the migration process.
What strategies are used for migrating persistent stores?
Netflix employs an ETL-based dual-write strategy for migrating persistent stores, which includes an initial load of data, continuous dual-writes, verification of records, and a final cut-over to the new data store. This approach ensures data integrity and system consistency throughout the migration.

Key Actionable Insights

1
Implement Sticky Canaries to validate changes in production environments effectively.
Using Sticky Canaries allows for a more comprehensive assessment of system performance beyond just service-level metrics, ensuring that the entire request lifecycle is monitored for regressions.
2
Utilize A/B Testing to manage risk during significant backend migrations.
By controlling the percentage of users exposed to new changes, A/B Testing helps identify potential issues early, allowing for adjustments before a full rollout.
3
Adopt Traffic Dialing as a final step in the migration process to mitigate risks.
Traffic Dialing enables real-time adjustments to the flow of traffic, ensuring that any unforeseen issues can be quickly addressed without affecting the overall user experience.
4
Conduct thorough clean-up post-migration to prevent technical debt.
Removing migration-related code and configurations ensures that the system remains efficient and manageable, paving the way for smoother future migrations.

Common Pitfalls

1
Failing to monitor customer metrics during canary deployments can lead to undetected regressions.
It's crucial to track broader system operational metrics to ensure that any negative impacts on the customer experience are identified and addressed promptly.