Overview
This article discusses the integration of batch and stream processing to create a near real-time dashboard for Recruiter usage statistics at LinkedIn. It highlights the challenges and solutions in maintaining data consistency and freshness while leveraging technologies like Apache Calcite and Apache Pinot.
What You'll Learn
1
How to leverage Apache Calcite for auto-generating streaming code from batch scripts
2
Why maintaining a single codebase for batch and streaming processing is beneficial
3
How to implement event deduplication in streaming pipelines using Apache Samza
Prerequisites & Requirements
- Familiarity with batch and stream processing concepts
- Experience with Apache Kafka and Apache Samza(optional)
Key Questions Answered
What are the key requirements for the Recruiter usage statistics dashboard?
The dashboard requires data freshness with metrics updated every 10 minutes, backfill capability for past metrics, easy maintenance with a single codebase, consistency in metrics regardless of the engine used, and support for both additive and non-additive metrics.
How does the solution ensure data consistency between batch and streaming engines?
The solution uses Apache Kafka in at-least-once mode for streaming and Apache Gobblin for deduplication in batch processing. This approach helps minimize discrepancies in metrics, achieving an average of 0.003% discrepancies after implementing deduplication logic.
What challenges arise from using both batch and streaming engines?
Using both engines can lead to inconsistencies in metrics due to differences in event processing. For instance, duplicate events in streaming can cause discrepancies, while batch processing may deduplicate events, leading to different counts for the same metrics.
Key Statistics & Figures
Average discrepancies in metrics
0.003%
This statistic reflects the effectiveness of the deduplication logic implemented in the streaming pipeline.
Metrics updated frequency
every 10 minutes
This ensures that customers have access to the most recent data on the dashboard.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Apache Calcite
Used for auto-generating streaming Java code from batch Pig scripts.
Database
Apache Pinot
Serves as a real-time distributed OLAP datastore for querying metrics.
Messaging
Apache Kafka
Facilitates event streaming in at-least-once mode.
Stream Processing
Apache Samza
Processes streaming events and implements deduplication logic.
Data Integration
Apache Gobblin
Handles ETL processes and deduplication for batch processing.
Key Actionable Insights
1Implement a single codebase for both batch and streaming processing to simplify maintenance and reduce operational overhead.This approach minimizes the complexity of managing two separate codebases and ensures that all transformations are defined in one place, making it easier to update and maintain.
2Utilize event deduplication techniques in streaming systems to enhance data accuracy.Implementing deduplication logic can significantly reduce discrepancies in metrics, ensuring that the dashboard reflects accurate and reliable data for decision-making.
3Leverage Apache Pinot for real-time OLAP capabilities to serve client requests efficiently.Using Pinot allows for quick querying of metrics computed from both batch and streaming engines, ensuring users have access to the most recent data without significant delays.
Common Pitfalls
1
Neglecting the need for deduplication in streaming pipelines can lead to significant discrepancies in metrics.
Without proper deduplication, duplicate events processed by the streaming engine can inflate metrics, resulting in inaccurate reporting and poor decision-making.
2
Maintaining separate codebases for batch and streaming processing increases operational complexity.
This can lead to difficulties in updating logic consistently across both systems, increasing the risk of errors and inconsistencies in metric calculations.
Related Concepts
Lambda Architecture
Real-time Data Processing
Event-driven Architecture