Overview
The article introduces Atlas, Netflix's primary telemetry platform designed for time-series data monitoring and analysis. It discusses the challenges faced with previous systems, the goals for Atlas, and the architectural decisions made to support scalability, resilience, and performance.
What You'll Learn
1
How to implement a scalable telemetry platform for time-series data
2
Why dimensionality is crucial for effective data monitoring
3
How to use a stack language for embedding and linking telemetry data
4
When to prioritize resilience over data completeness in monitoring systems
Prerequisites & Requirements
- Understanding of time-series data and telemetry concepts
- Familiarity with monitoring tools and data visualization(optional)
Key Questions Answered
What challenges did Netflix face with their previous telemetry system?
Netflix's previous telemetry system struggled with scaling to two million distinct time series, which became inadequate as their global expansion increased the need for monitoring up to 20 million metrics. The limitations of the old system prompted the development of Atlas to better handle the growing volume of data.
How does Atlas handle dimensionality in time-series metrics?
Atlas allows metrics to be defined as arbitrary unique sets of key-value pairs, significantly improving dimensionality over previous systems. This flexibility enables users to specify keys relevant to their use cases, supporting essentially unlimited unique values for any key.
What is the purpose of the query layer in Atlas?
The query layer in Atlas provides a common API that allows for flexibility in backend implementations and enables merged views across different data sources. It facilitates efficient query and aggregation operations across regional and global deployments.
What engineering practices are emphasized for performance in Atlas?
Atlas is designed to support high-performance queries over dimensional time series data, often requiring large aggregations. The system is engineered to handle billions of data points per second while maintaining efficient query response times.
Key Statistics & Figures
Distinct time series supported by Atlas
1.2 billion
During failover exercises, Atlas sustained greater than 1.2 billion time series, showcasing its scalability.
Metrics volume growth
20 million metrics
Netflix needed to scale their telemetry system to handle up to 20 million metrics due to global expansion and increased platform diversity.
Data points published per minute
billions
Atlas is capable of publishing billions of data points per minute, demonstrating its high throughput capabilities.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Atlas
Primary telemetry platform for monitoring time-series data at Netflix.
Cloud Infrastructure
AWS
Used for hosting and scaling Atlas deployments across different regions.
Data Processing
Hadoop
Utilized for processing historical data and generating reports.
Key Actionable Insights
1Implement a flexible dimensionality model in your telemetry systems to enhance data granularity.By allowing metrics to be defined with key-value pairs, you can capture more meaningful data and improve analysis capabilities, similar to how Atlas allows for arbitrary dimensions.
2Prioritize resilience in your monitoring systems to ensure operational insight during failures.Adopting a strategy where restoring service is prioritized over data completeness can lead to better operational outcomes, as demonstrated by Atlas's design principles.
3Utilize a stack language for embedding telemetry data in your applications.This approach enables easy sharing and linking of visualizations, making it simpler for teams to collaborate and analyze data effectively.
Common Pitfalls
1
Overcomplicating metric names can lead to confusion and inefficiency in data querying.
When metrics are defined with overly complex names, users may struggle to extract meaningful insights, resorting to complex regular expressions. Simplifying metric definitions can enhance usability and data accessibility.
Related Concepts
Time-series Data Monitoring
Telemetry Systems Design
Scalability In Data Processing
Resilience Engineering