Overview
The article discusses the upgrade of Pinterest's operational metrics system, detailing the transition from the deprecated Ostrich library to an in-house solution called Pinterest StatsCollector. This upgrade aims to enhance metrics accuracy, reduce storage requirements, and improve overall performance in monitoring and alerting systems.
What You'll Learn
1
How to implement an in-house metrics collector for Java services
2
Why accurate service-level aggregation of metrics is crucial for operational health
3
How to optimize metrics reporting using Thread-Local Stats
Prerequisites & Requirements
- Understanding of metrics collection and reporting in software systems
- Familiarity with Java programming language
Key Questions Answered
What are the main reasons for upgrading Pinterest's metrics system?
The upgrade is necessary due to the deprecation of the Ostrich library, which has created technical debt and hindered accurate service-level aggregation of metrics. The new system aims to provide more accurate metrics and reduce storage requirements.
How does Pinterest StatsCollector improve metrics collection?
Pinterest StatsCollector is designed to be thread-safe and allows for the collection of Counters, Gauges, and Histograms. It optimizes performance by caching synchronized hashmap lookups and implementing Thread-Local Stats for better efficiency.
What is the impact of the MABS pipeline on metrics storage?
The MABS pipeline significantly reduces metrics storage requirements by aggregating metrics across services, resulting in up to 99% savings in metrics storage. This allows for more efficient data management and accurate monitoring.
What optimizations were made in the design of the Gauge API?
The Gauge API was improved by using WeakReference objects, allowing monitored objects to be garbage collected when no longer needed. This reduces memory footprint and ensures that monitoring does not negatively impact application performance.
Key Statistics & Figures
Monthly active users
300 million
This milestone highlights the scale at which Pinterest operates and the importance of reliable metrics for such a large user base.
Reduction in metrics storage
99%
The MABS pipeline implementation led to a drastic reduction in the dimensions of stored metrics, optimizing storage costs.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Java
Used to develop the Pinterest StatsCollector for metrics collection and reporting.
Streaming
Kafka
Utilized as a streaming buffer in the MABS pipeline for metrics processing.
Data Processing
Spark
Employed for aggregating metrics in the MABS pipeline.
Database
Goku
Pinterest's in-house time series database for storing aggregated metrics.
Data Structure
T-digest
Used for backing histogram metrics in the new metrics collection system.
Key Actionable Insights
1Transitioning to an in-house metrics collector can significantly enhance the accuracy of operational metrics.By implementing a custom solution like Pinterest StatsCollector, organizations can tailor their metrics collection to their specific needs, leading to better insights and operational health.
2Utilizing Thread-Local Stats can optimize performance in high-throughput environments.This approach minimizes synchronization overhead, making it ideal for applications that require rapid metrics reporting without compromising on accuracy.
3Adopting a language-agnostic metrics aggregation pipeline can streamline data processing across diverse services.This ensures that all services can report metrics consistently, improving overall data reliability and reducing the complexity of metrics management.
Common Pitfalls
1
Relying on deprecated libraries can lead to technical debt and hinder performance.
Using outdated libraries like Ostrich can create challenges in maintaining accurate metrics and may require significant refactoring to replace.
2
Incorrectly aggregating percentiles can lead to false alerts.
Averages of percentiles do not accurately represent true metrics, which can mislead operational decisions and affect system reliability.
Related Concepts
Metrics Collection And Reporting
Performance Optimization Techniques
Data Aggregation Methods