In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for Black Friday and Cyber Monday.
Overview
This article discusses how Shopify's Data Platform Engineering team scaled their data platform to handle the unprecedented data volume during Black Friday and Cyber Monday (BFCM). It outlines their strategies for ingestion, processing, and delivering data reliably to ensure merchants have access to critical information without interruption.
What You'll Learn
How to prepare a data platform for high-volume events like Black Friday and Cyber Monday
Why tiered services are essential for prioritizing data reliability and infrastructure budgets
How to identify and manipulate service-specific levers to maintain data freshness and latency
When to run load tests and what metrics to evaluate for service reliability
Prerequisites & Requirements
- Understanding of data ingestion and processing concepts
- Familiarity with cloud infrastructure and Kubernetes(optional)
Key Questions Answered
What strategies did Shopify use to scale their data platform for BFCM?
How does Shopify prioritize its data services?
What is throughput risk and how does it affect data services?
What metrics did Shopify track during their load tests?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Implement a tiered services approach to prioritize data reliability based on impact.By categorizing services into tiers, teams can allocate resources more effectively, ensuring that critical services receive the attention they need during high-demand periods.
2Conduct regular load tests to simulate high-traffic scenarios and identify potential bottlenecks.Load testing helps teams understand how their systems will perform under pressure, allowing them to make informed decisions about scaling and resource allocation before peak events.
3Establish clear service objectives and levers for each data service to maintain performance.Knowing what levers to adjust, such as job frequency or resource allocation, allows teams to respond quickly to changing demands and ensure data freshness and latency are maintained.