Overview
The article discusses Pinterest's annual holiday load testing for its Druid system, focusing on ensuring the platform can handle increased traffic and data ingestion during peak holiday periods. It outlines the testing methodologies, results, and learnings from the process.
What You'll Learn
1
How to set up a load testing environment that mirrors production for Druid
2
Why using real production queries is crucial for accurate load testing
3
How to evaluate system health using QPS and P99 latency metrics
4
When to increase broker pool size based on traffic predictions
Prerequisites & Requirements
- Understanding of Druid architecture and query processing
- Familiarity with load testing tools and metrics(optional)
Key Questions Answered
How does Pinterest ensure its Druid system can handle increased holiday traffic?
Pinterest conducts annual load testing on its Druid system to ensure it can manage increased query traffic, data ingestion, and data volume during the holiday season. This involves creating a testing environment that mirrors production and using real production queries to simulate expected traffic.
What metrics are used to evaluate the performance of the Druid system during load testing?
The main metrics used to evaluate Druid's performance include Queries Per Second (QPS) and P99 latency. These metrics help identify bottlenecks in the system and ensure it meets the Service Level Agreements (SLAs) required by clients.
What are the benefits of using real production queries for load testing?
Using real production queries allows Pinterest to replicate actual user behavior more accurately, ensuring that the load tests reflect how the system will perform under real conditions. This approach helps identify potential issues that might not be evident with generated queries.
When should Pinterest consider increasing the broker pool size?
Pinterest should consider increasing the broker pool size when traffic predictions indicate that the current setup may not handle the expected load. This proactive measure helps prevent performance degradation during peak traffic periods.
Key Statistics & Figures
P99 latency
Increased from 250-500ms to 400-800ms
This change was observed when the traffic was doubled during load testing.
Expected QPS during holiday traffic
300 QPS baseline
This baseline was established to determine how many threads to use for testing increased holiday traffic.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Druid
Used for real-time analytics and query processing during load testing.
Database
Mysql
Utilized for copying data segments from the production environment to the test environment.
Message Broker
Kafka
Ingestion source for testing increased data ingestion rates.
Key Actionable Insights
1Establish a dedicated testing environment that closely replicates production to ensure accurate load testing results.This setup allows for realistic simulations of traffic and data ingestion, helping to identify potential bottlenecks before they impact users during peak periods.
2Capture and analyze real production queries to improve the accuracy of load tests.This practice helps ensure that the load tests reflect actual user behavior, leading to better preparedness for high traffic scenarios.
3Monitor QPS and P99 latency metrics closely during testing to identify system health and performance issues.These metrics provide critical insights into how the system will perform under load, allowing for timely adjustments to infrastructure as needed.
4Communicate with client teams about expected traffic increases to better prepare for load testing.Understanding client expectations can help align testing efforts with real-world scenarios, ensuring that the system can handle anticipated demands.
Common Pitfalls
1
Failing to accurately estimate the expected increase in traffic can lead to inadequate testing.
Without proper estimations, teams may not prepare the system sufficiently, leading to performance issues during peak traffic times.
2
Neglecting to monitor system health metrics during load testing can result in undetected bottlenecks.
Monitoring metrics like QPS and P99 latency is crucial for identifying performance issues that could affect user experience.
Related Concepts
Load Testing Methodologies
Performance Metrics In Backend Systems
Data Ingestion Strategies