How we reduced error rates, and dropped latencies across merchants’ flows Reading Time: 6 Minutes Shopify merchants trust that when they build their stores on our platform, we’ve got their back. They can focus on their business, while we handle everything else. Any failures or degradations that happen put our promise of a sturdy, battle-tested platform at risk. To do so, we need to ensure that the platform stays up and stays reliable. Shopify since 2016 has grown from 375,000 merchants to over 600,000. As of today, an average of 450,000 S3 operations per second are being made through our platform. However, that rapid growth also came with an increased S3 error rate, and increased read and write latencies. While we use S3 at Shopify, if your application uses any flavor of cloud storage, and its use of cloud storage strongly correlates with the growth of your user base—whether it’s storing user or event data—I’m hoping this post provides some insight into how to optimize your cloud storage!
Overview
The article discusses how Shopify optimized its cloud storage usage to reduce error rates and latencies, particularly focusing on S3 operations. It highlights the challenges faced due to rapid merchant growth and presents solutions that improved reliability and performance.
What You'll Learn
How to optimize cloud storage usage to reduce error rates
Why partitioning strategies are crucial for performance in cloud storage
When to implement randomness in asset naming to prevent rate limits
Key Questions Answered
How did Shopify reduce S3 error rates and latencies?
What are SlowDown exceptions in AWS S3?
What impact did the changes have on latencies?
Key Statistics & Figures
Technologies & Tools
Key Actionable Insights
1Implement a hashing strategy for asset naming to distribute writes across multiple partitions.This approach can help prevent hitting throughput limits in cloud storage, especially as user activity increases, ensuring smoother operations and reduced error rates.
2Monitor S3 operations for SlowDown exceptions to identify potential performance bottlenecks.Understanding when these exceptions occur can help in adjusting strategies proactively, maintaining a reliable platform for users.
3Consider partitioning strategies that allow for graceful splits as throughput increases.This can prevent abrupt failures in operations, ensuring that growth does not compromise service reliability.