Scaling the collection of self-service metrics

Stephen Bisordi
11 min readintermediate
--
View Original

Overview

The article discusses LinkedIn's transition to a more efficient self-service metrics collection system called Autometrics, detailing its growth, design challenges, and iterative improvements. It highlights the significant increase in metrics collected and the architectural changes made to support this scaling.

What You'll Learn

1

How to scale metrics collection systems effectively

2

Why using Elasticsearch can improve metrics discoverability

3

How to implement a new API for metrics aggregation

Prerequisites & Requirements

  • Understanding of metrics collection and monitoring systems
  • Familiarity with Elasticsearch and Couchbase(optional)

Key Questions Answered

What were the key improvements made to Autometrics?
Key improvements to Autometrics included the introduction of a new API that replaced the old inGraphs API, allowing for better performance and cross-data center requests. Additionally, a new metrics index using Elasticsearch and Couchbase was implemented to enhance metrics discoverability and reduce the time for new metrics to appear.
How did LinkedIn's metrics collection scale over time?
LinkedIn's metrics collection scaled from over 500,000 metrics collected per minute to around 320,000,000 metrics per minute. This growth was accompanied by an increase in disk usage from roughly 870GB to over 530TB, and the number of metrics-based alerts grew from a few thousand to over 600,000.
What challenges did LinkedIn face with the initial Autometrics design?
The initial design faced challenges such as reliance on NFS, which became problematic as the number of metrics increased. This led to complexity and performance issues, particularly in data access and the ability to aggregate metrics across multiple data centers.

Key Statistics & Figures

Metrics collected per minute
320,000,000
This was an increase from over 500,000 metrics collected per minute.
Disk space used
530TB
This was an increase from roughly 870GB.
Metrics-based alerts
600,000
This was an increase from a few thousand alerts.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement an iterative design approach for scaling systems to ensure flexibility and adaptability.
This approach allows for gradual improvements without significant downtime, as seen with Autometrics' ability to redesign components without service interruption.
2
Utilize Elasticsearch for indexing metrics to enhance search capabilities and reduce retrieval times.
By implementing a metrics index with Elasticsearch, LinkedIn improved the speed at which new metrics became available and facilitated easier discovery across large datasets.
3
Consider replacing outdated data serving methods with modern APIs to improve performance.
The transition from NFS to a new Autometrics API allowed for better data handling and improved performance, showcasing the importance of modernizing infrastructure.

Common Pitfalls

1
Relying on outdated data serving methods can lead to performance bottlenecks.
As the number of metrics increased, the reliance on NFS became problematic, highlighting the need for modern APIs and data handling techniques.
2
Failing to implement a robust indexing system can hinder metrics discoverability.
Without an effective indexing solution, searching for and managing metrics became impractical, demonstrating the importance of using tools like Elasticsearch.

Related Concepts

Metrics Collection Systems
Monitoring And Alerting Frameworks
Data Indexing And Search Technologies