How to Reliably Scale Your Data Platform for High Volumes

Arbab Ahmed

In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for Black Friday and Cyber Monday.

Shopify

•

Arbab Ahmed

•11 min read•intermediate•

--

•View Original

ApacheApache SparkGolangKubernetesMySQLSQL

Overview

This article discusses how Shopify's Data Platform Engineering team scaled their data platform to handle the unprecedented data volume during Black Friday and Cyber Monday (BFCM). It outlines their strategies for ingestion, processing, and delivering data reliably to ensure merchants have access to critical information without interruption.

What You'll Learn

1

How to prepare a data platform for high-volume events like Black Friday and Cyber Monday

2

Why tiered services are essential for prioritizing data reliability and infrastructure budgets

3

How to identify and manipulate service-specific levers to maintain data freshness and latency

4

When to run load tests and what metrics to evaluate for service reliability

Prerequisites & Requirements

Understanding of data ingestion and processing concepts
Familiarity with cloud infrastructure and Kubernetes(optional)

Key Questions Answered

What strategies did Shopify use to scale their data platform for BFCM?

Shopify's Data Platform Engineering team focused on preparing their ingestion and processing systems by identifying primary service objectives, pinpointing service-specific levers, running load tests, and confirming mitigation strategies. This proactive approach ensured their systems could handle the anticipated increase in data volume during BFCM.

How does Shopify prioritize its data services?

Shopify employs a tiered services taxonomy that categorizes data services based on their impact on merchants. Tier 1 services are critical externally, while Tier 2 services are critical internally. This approach helps allocate resources effectively and ensures reliability where it matters most.

What is throughput risk and how does it affect data services?

Throughput risk refers to the potential reliability issues that arise as data throughput requirements increase, particularly affecting ingestion and processing systems. During high-volume events like BFCM, this risk necessitates careful planning and resource allocation to maintain service reliability.

What metrics did Shopify track during their load tests?

During load tests, Shopify tracked metrics such as service uptime, the ability of the underlying code to handle resource constraints, and the responsiveness of service alarms. This data helped them identify potential reliability risks and refine their mitigation strategies.

Key Statistics & Figures

Average throughput increase

150 percent

This increase was observed in the Shopify data platform during the BFCM sales weekend.

Monthly MySQL records processed

880 billion

This is the average volume of records processed by the Shopify data platform each month.

Monthly Kafka messages processed

1.75 trillion

This is the average volume of Kafka messages processed by the Shopify data platform each month.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Processing Framework

Apache Spark

Used for processing data in batches to form business insights.

Data Modeling Tool

Dbt

Utilized for enriching batches of data with models developed by data scientists.

Cloud Infrastructure

Google Kubernetes Engine

Hosts services like Longboat and Reportify, allowing for scalable resource management.

Database

Bigtable

Supports the internal collection of streaming and serving applications.

Database

Cloudsql

Part of the backend infrastructure for merchant-facing analytics.

Key Actionable Insights

1
Implement a tiered services approach to prioritize data reliability based on impact.
By categorizing services into tiers, teams can allocate resources more effectively, ensuring that critical services receive the attention they need during high-demand periods.

2
Conduct regular load tests to simulate high-traffic scenarios and identify potential bottlenecks.
Load testing helps teams understand how their systems will perform under pressure, allowing them to make informed decisions about scaling and resource allocation before peak events.

3
Establish clear service objectives and levers for each data service to maintain performance.
Knowing what levers to adjust, such as job frequency or resource allocation, allows teams to respond quickly to changing demands and ensure data freshness and latency are maintained.

Common Pitfalls

1

Failing to run load tests before high-traffic events can lead to unexpected system failures.

Without load testing, teams may not identify bottlenecks or performance issues, resulting in poor user experiences during peak times.

2

Neglecting to update mitigation strategies can leave services vulnerable to outages.

If mitigation strategies are outdated, teams may struggle to respond effectively to incidents, leading to prolonged downtime and frustration for users.

Related Concepts

Data Ingestion Strategies

Data Processing Frameworks

Cloud Infrastructure Management

Service Reliability Engineering