Scaling the collection of self-service metrics

Stephen Bisordi

•

Stephen Bisordi

•11 min read•intermediate•

--

•View Original

ElasticsearchHaystack

Overview

The article discusses LinkedIn's transition to a more efficient self-service metrics collection system called Autometrics, detailing its growth, design challenges, and iterative improvements. It highlights the significant increase in metrics collected and the architectural changes made to support this scaling.

What You'll Learn

1

How to scale metrics collection systems effectively

2

Why using Elasticsearch can improve metrics discoverability

3

How to implement a new API for metrics aggregation

Prerequisites & Requirements

Understanding of metrics collection and monitoring systems
Familiarity with Elasticsearch and Couchbase(optional)

Key Questions Answered

What were the key improvements made to Autometrics?

Key improvements to Autometrics included the introduction of a new API that replaced the old inGraphs API, allowing for better performance and cross-data center requests. Additionally, a new metrics index using Elasticsearch and Couchbase was implemented to enhance metrics discoverability and reduce the time for new metrics to appear.

How did LinkedIn's metrics collection scale over time?

LinkedIn's metrics collection scaled from over 500,000 metrics collected per minute to around 320,000,000 metrics per minute. This growth was accompanied by an increase in disk usage from roughly 870GB to over 530TB, and the number of metrics-based alerts grew from a few thousand to over 600,000.

What challenges did LinkedIn face with the initial Autometrics design?

The initial design faced challenges such as reliance on NFS, which became problematic as the number of metrics increased. This led to complexity and performance issues, particularly in data access and the ability to aggregate metrics across multiple data centers.

Key Statistics & Figures

Metrics collected per minute

320,000,000

This was an increase from over 500,000 metrics collected per minute.

Disk space used

530TB

This was an increase from roughly 870GB.

Metrics-based alerts

600,000

This was an increase from a few thousand alerts.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Search Engine

Elasticsearch

Used for creating a metrics index to enhance discoverability and search capabilities.

Database

Couchbase

Handles ingestion of metrics and populates the Elasticsearch index.

API

Autometrics API

Replaced the inGraphs API to improve metrics aggregation and performance.

Key Actionable Insights

1
Implement an iterative design approach for scaling systems to ensure flexibility and adaptability.
This approach allows for gradual improvements without significant downtime, as seen with Autometrics' ability to redesign components without service interruption.

2
Utilize Elasticsearch for indexing metrics to enhance search capabilities and reduce retrieval times.
By implementing a metrics index with Elasticsearch, LinkedIn improved the speed at which new metrics became available and facilitated easier discovery across large datasets.

3
Consider replacing outdated data serving methods with modern APIs to improve performance.
The transition from NFS to a new Autometrics API allowed for better data handling and improved performance, showcasing the importance of modernizing infrastructure.

Common Pitfalls

1

Relying on outdated data serving methods can lead to performance bottlenecks.

As the number of metrics increased, the reliance on NFS became problematic, highlighting the need for modern APIs and data handling techniques.

2

Failing to implement a robust indexing system can hinder metrics discoverability.

Without an effective indexing solution, searching for and managing metrics became impractical, demonstrating the importance of using tools like Elasticsearch.

Related Concepts

Metrics Collection Systems

Monitoring And Alerting Frameworks

Data Indexing And Search Technologies