Overview
The article discusses the ads indexing system at Pinterest, detailing its architecture, design, and implementation. It emphasizes the importance of real-time data processing for ad delivery, highlighting the system's ability to handle over 100 million documents with seconds-level end-to-end latency.
What You'll Learn
1
How to design a scalable incremental indexing system for ad delivery
2
Why data freshness and correctness are critical in ad indexing
3
How to implement a hybrid real-time and batch processing pipeline
4
When to use distributed transactions in data processing
Prerequisites & Requirements
- Understanding of data processing pipelines and indexing concepts
- Familiarity with Kafka and HBase(optional)
Key Questions Answered
How does Pinterest achieve seconds-level indexing latency?
Pinterest's ads indexing system employs a hybrid approach combining real-time incremental and batch processing pipelines. This design allows for rapid updates and ensures that the ads document remains fresh, achieving end-to-end latency in seconds while managing over 100 million documents.
What are the core components of the ads indexing system?
The core components include Gateway, Updater, Storage Repo, and Argus. Each component plays a distinct role in processing updates, managing data storage, and generating servable ad documents, ensuring efficient data flow and system scalability.
What are the responsibilities of the real-time incremental pipeline?
The real-time incremental pipeline supports distributed transactions, push-based notifications for ads control data changes, and maintains two logical pipelines with high and medium priorities to ensure low latency and data consistency during processing.
What challenges does the ads indexing system face in production?
The system encounters challenges such as pipeline clogging due to spikes in updates, incorrect data introduced by releases, and drops in data quality or coverage. Strategies like shutting down real-time serving or rolling back to previous versions are employed to mitigate these issues.
Key Statistics & Figures
Ads control data update-to-serve p90 latency
< 60 seconds
This latency is achieved in 99.9% of cases, demonstrating the system's efficiency.
Ads control data update-to-serve max latency
< 24 hours
This metric is maintained consistently across all updates.
Daily number of dropped messages in incremental pipeline
Single-digit
This indicates a high level of reliability in the processing pipeline.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Kafka
Used for streaming data updates and notifications in the ads indexing system.
Database
Hbase
Serves as the transactional NoSQL database for the Storage Repo.
Database
Apache Omid
Provides transaction management on top of HBase.
Key Actionable Insights
1Implement a hybrid indexing system to balance real-time and batch processing needs.This approach allows for quick updates and ensures data integrity, which is crucial for maintaining advertiser trust and optimizing ad delivery performance.
2Utilize distributed transactions to maintain data consistency during high-volume updates.This is particularly important in environments with large-scale concurrent processing, as it helps prevent data corruption and ensures reliable ad targeting.
3Regularly monitor system health metrics to ensure data freshness and coverage.By keeping track of E2E latency and data consistency, teams can quickly identify and address potential issues, maintaining a high-quality ad delivery experience.
Common Pitfalls
1
Failing to monitor the health of the real-time pipeline can lead to significant delays in ad updates.
Without proper monitoring, spikes in data updates can overwhelm the system, causing latency issues and impacting ad delivery performance.
2
Introducing incorrect data during releases can compromise the integrity of the ads indexing system.
To avoid this, it's crucial to implement robust staging and monitoring processes to catch issues before they reach production.
Related Concepts
Data Processing Pipelines
Incremental Indexing
Real-time Data Processing
Distributed Transactions