Future Proofing Our Cloud Storage Usage

Andrew Louis

How we reduced error rates, and dropped latencies across merchants’ flows Reading Time: 6 Minutes Shopify merchants trust that when they build their stores on our platform, we’ve got their back. They can focus on their business, while we handle everything else. Any failures or degradations that happen put our promise of a sturdy, battle-tested platform at risk. To do so, we need to ensure that the platform stays up and stays reliable. Shopify since 2016 has grown from 375,000 merchants to over 600,000. As of today, an average of 450,000 S3 operations per second are being made through our platform. However, that rapid growth also came with an increased S3 error rate, and increased read and write latencies. While we use S3 at Shopify, if your application uses any flavor of cloud storage, and its use of cloud storage strongly correlates with the growth of your user base—whether it’s storing user or event data—I’m hoping this post provides some insight into how to optimize your cloud storage!

Shopify

•

Andrew Louis

•5 min read•intermediate•

--

•View Original

AWSGoogle CloudGoogle Cloud Storage

Overview

The article discusses how Shopify optimized its cloud storage usage to reduce error rates and latencies, particularly focusing on S3 operations. It highlights the challenges faced due to rapid merchant growth and presents solutions that improved reliability and performance.

What You'll Learn

1

How to optimize cloud storage usage to reduce error rates

2

Why partitioning strategies are crucial for performance in cloud storage

3

When to implement randomness in asset naming to prevent rate limits

Key Questions Answered

How did Shopify reduce S3 error rates and latencies?

Shopify reduced S3 error rates and latencies by implementing a hashing strategy for asset naming, which distributed writes across multiple partitions. This approach minimized the likelihood of hitting throughput limits and significantly decreased the occurrence of SlowDown exceptions, leading to improved reliability.

What are SlowDown exceptions in AWS S3?

SlowDown exceptions in AWS S3 occur when a partition exceeds its request rate limit, causing all operations on that partition to fail temporarily. Shopify experienced these exceptions as their platform grew, impacting multiple merchants due to high request rates.

What impact did the changes have on latencies?

After implementing the hashing strategy, Shopify observed a 60% reduction in median latencies and a 25% reduction in the 95th percentile latencies for S3 operations. This improvement enhanced the overall performance of their cloud storage usage.

Key Statistics & Figures

S3 operations per second

450,000

This is the average number of S3 operations being made through Shopify's platform.

Reduction in median latencies

60%

This reduction was observed after implementing the hashing strategy for asset naming.

Reduction in 95th percentile latencies

25%

This improvement was also a result of the changes made to asset naming strategies.

Technologies & Tools

Cloud Storage

AWS S3

Used for storing merchant uploaded data such as product images and theme assets.

Key Actionable Insights

1
Implement a hashing strategy for asset naming to distribute writes across multiple partitions.
This approach can help prevent hitting throughput limits in cloud storage, especially as user activity increases, ensuring smoother operations and reduced error rates.

2
Monitor S3 operations for SlowDown exceptions to identify potential performance bottlenecks.
Understanding when these exceptions occur can help in adjusting strategies proactively, maintaining a reliable platform for users.

3
Consider partitioning strategies that allow for graceful splits as throughput increases.
This can prevent abrupt failures in operations, ensuring that growth does not compromise service reliability.

Common Pitfalls

1

Failing to implement randomness in asset naming can lead to hitting S3 partition throughput limits.

Without this randomness, writes from a single shop could overwhelm a partition, resulting in SlowDown exceptions that affect multiple merchants.