How we prepare Shopify for BFCM

From March to October we simulated traffic tsunamis, injected chaos, and fixed every bottleneck before our merchants needed us most.

Kyle Petroski and Matthew Frail
9 min readintermediate
--
View Original

Overview

The article details Shopify's extensive preparations for the Black Friday Cyber Monday (BFCM) weekend, emphasizing the importance of year-round resilience and proactive testing. Key strategies include capacity planning, chaos engineering through Game Days, and a comprehensive operational plan to ensure the platform can handle unprecedented traffic loads.

What You'll Learn

1

How to conduct chaos engineering exercises to test system resilience

2

Why continuous load testing is critical for infrastructure readiness

3

How to implement a Resiliency Matrix for documenting system vulnerabilities

4

When to perform scale testing to validate platform performance

Prerequisites & Requirements

  • Understanding of cloud infrastructure and load testing concepts
  • Familiarity with load testing tools like Genghis and Toxiproxy(optional)

Key Questions Answered

What strategies does Shopify use to prepare for BFCM traffic?
Shopify employs a comprehensive readiness program that includes capacity planning, infrastructure upgrades, chaos engineering through Game Days, and continuous load testing. These strategies ensure that the platform can handle extreme traffic loads and maintain performance during peak periods.
How does Shopify simulate production failures during testing?
Shopify uses Game Days to intentionally inject faults into their systems, simulating production failures at BFCM scale. This allows teams to assess incident response and identify vulnerabilities in critical user paths like checkout and payment processing.
What is the Resiliency Matrix and how is it used?
The Resiliency Matrix is a centralized documentation tool that tracks the operational state of critical services, failure scenarios, recovery procedures, and incident response playbooks. It is continuously updated to ensure teams are prepared for potential issues during high-traffic events.
What performance records did Shopify achieve during BFCM 2024?
During BFCM 2024, Shopify processed 57.3 PB of data, executed 10.5 trillion database queries, and peaked at 284 million requests per minute on edge servers. This level of traffic is now considered a regular day for Shopify, demonstrating their infrastructure's scalability.

Key Statistics & Figures

Data processed during BFCM 2024
57.3 PB
This was part of the overall performance metrics achieved during the peak shopping weekend.
Database queries executed
10.5 trillion
This number reflects the scale of operations during BFCM 2024.
Peak requests per minute on edge servers
284 million
This peak was reached on Black Friday, showcasing the infrastructure's capability.
Peak requests per minute on app servers
80 million
This indicates the load handled by the application servers during peak traffic.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure
Google Cloud
Used for capacity planning and multi-region strategy to ensure reliability during BFCM.
Messaging System
Kafka
Used for managing data streams and ensuring data freshness during traffic spikes.
Load Testing Tool
Genghis
Simulates user behavior and traffic patterns to identify breaking points in the infrastructure.
Network Simulation Tool
Toxiproxy
Injects network failures and partitions to test system resilience under adverse conditions.

Key Actionable Insights

1
Implement chaos engineering practices to proactively identify system weaknesses before peak traffic periods.
By simulating failures and testing critical user journeys, teams can build muscle memory for incident response and address vulnerabilities well in advance of high-traffic events.
2
Utilize a Resiliency Matrix to document and track system vulnerabilities and incident response procedures.
This tool helps teams maintain awareness of potential issues and ensures that all members are prepared to respond effectively during incidents.
3
Conduct regular load testing to identify capacity limits and optimize infrastructure.
By simulating user behavior and gradually ramping up traffic, teams can discover breaking points and make necessary adjustments before peak seasons.
4
Coordinate with cloud providers for capacity planning to avoid outages during high-traffic events.
Submitting accurate traffic estimates to cloud providers ensures that sufficient resources are available to handle expected loads, preventing service disruptions.

Common Pitfalls

1
Failing to conduct thorough load testing can lead to unexpected system failures during peak traffic.
Without identifying capacity limits in advance, teams may be unprepared for the actual load, leading to outages and poor user experiences.
2
Neglecting to update incident response procedures can result in inefficient handling of system failures.
If teams do not regularly review and practice their incident response plans, they may struggle to respond effectively during critical incidents.

Related Concepts

Chaos Engineering
Load Testing Methodologies
Cloud Infrastructure Management
Incident Response Best Practices