Capacity Planning at Scale

We cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.

Kathryn Tang
7 min readintermediate
--
View Original

Overview

The article discusses Shopify's capacity planning strategies for handling increased traffic during the Black Friday and Cyber Monday (BFCM) shopping period. It outlines the collaboration with Google Cloud Platform, the importance of scalability testing, and the proactive measures taken to ensure system resilience and stability.

What You'll Learn

1

How to forecast traffic levels for capacity planning

2

Why scalability testing is crucial before high-traffic events

3

How to implement a master resourcing plan for cloud deployment

Prerequisites & Requirements

  • Understanding of cloud computing concepts
  • Experience with capacity planning in cloud environments(optional)

Key Questions Answered

How does Shopify prepare for increased traffic during BFCM?
Shopify prepares for increased traffic during BFCM by forecasting traffic levels with data scientists, creating a master resourcing plan for Google Cloud, and conducting scalability tests to identify potential bottlenecks. This proactive approach ensures that the platform can handle the expected surge in traffic and maintain stability.
What challenges does Shopify face during BFCM?
During BFCM, Shopify faces challenges such as ensuring clusters can handle double the virtual machines, avoiding limitations in network design, and managing increased traffic through logging pipelines. These challenges necessitate careful capacity planning and resource allocation.
What is the significance of scalability testing for Shopify?
Scalability testing is significant for Shopify as it helps identify limits within their tech stack before high-traffic events. By conducting tests like the 'Oktoberfest scale-up', Shopify can discover and resolve potential issues early, ensuring a smoother experience during actual peak times.

Key Statistics & Figures

Projected traffic levels for BFCM
2 times the normal traffic
This projection is based on historical data and forecasts to ensure adequate resource allocation.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Computing
Google Cloud Platform
Used for capacity planning and resource management during peak traffic events.
Container Orchestration
Google Kubernetes Engine
Facilitates scaling and managing containerized applications during high-demand periods.

Key Actionable Insights

1
Implement a master resourcing plan for cloud deployment to effectively manage capacity during peak traffic.
This plan should include detailed estimates of required resources such as CPUs and storage, allowing for flexibility in resource allocation and regional failover during high-demand periods.
2
Conduct regular scalability tests to identify and mitigate potential bottlenecks in your system.
By simulating peak traffic scenarios, teams can uncover hidden issues that may not be apparent until under load, thus enhancing overall system resilience.
3
Collaborate closely with cloud service providers to optimize capacity planning based on historical data and forecasts.
Engaging with providers like Google Cloud Platform can provide valuable insights and support in scaling resources effectively in anticipation of traffic surges.

Common Pitfalls

1
Failing to conduct scalability tests can lead to unexpected bottlenecks during peak traffic.
Many organizations only realize their limitations when it's too late, which can result in downtime and a poor experience for users.