The Billion Data Point Challenge: Building a Query Engine for High Cardinality Time Series Data

Benjamin Raskin, Nikunj Aggarwal

Uber

•

Benjamin Raskin, Nikunj Aggarwal

•15 min read•advanced•

--

•View Original

GrafanaPrometheusSQL

Overview

The article discusses Uber's development of an in-house query engine for high cardinality time series data, addressing challenges faced in managing and querying vast amounts of metrics data. It highlights the architecture, performance metrics, and design choices made to optimize the system for scalability and efficiency.

What You'll Learn

1

How to design a scalable query engine for high cardinality time series data

2

Why memory utilization is critical in query performance

3

How to implement downsampling to improve query performance

4

When to use lazy evaluation in query execution

Prerequisites & Requirements

Understanding of time series databases and query languages
Familiarity with Grafana and Prometheus(optional)

Key Questions Answered

What challenges did Uber face in building a query engine for time series data?

Uber faced challenges related to scalability, memory utilization, and query performance as the volume of metrics data increased. They needed a system that could handle high cardinality and support multiple query languages while ensuring efficient resource management.

How does the M3 query engine architecture work?

The M3 query engine architecture consists of three phases: parsing, execution, and data retrieval. It utilizes a directed acyclic graph (DAG) format for query representation, allowing for flexible execution and support for multiple query languages.

What is the purpose of downsampling in the M3 query engine?

Downsampling is implemented to reduce the number of data points retrieved during queries, improving performance and preventing UI tools like Grafana from lagging. It helps maintain the integrity of the data while optimizing resource usage.

How does M3QL differ from traditional query languages?

M3QL is a tag-based, pipe-based query language designed for better discoverability and ease of use compared to traditional path-based languages like Graphite. It simplifies querying by allowing users to specify metrics using tags rather than exact paths.

Key Statistics & Figures

Queries handled per second

2,500 queries per second

This reflects the capacity of the M3 metrics query engine as of November 2018.

Data points returned per second

8.5 billion data points per second

This indicates the volume of data processed by the M3 query engine.

Network traffic handled

35 Gbps

This shows the throughput capability of the M3 metrics query engine.

Initial memory limit for a single query

3.5 GB

This limit was set to manage resource utilization effectively.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

M3

An open source metrics platform built by Uber to handle high cardinality time series data.

Frontend

Grafana

A visualization tool used for monitoring and displaying metrics data.

Backend

Prometheus

An open source monitoring system that integrates with the M3 query engine.

Key Actionable Insights

1
Implementing a lazy evaluation strategy can significantly reduce memory usage during query execution.
By delaying the allocation of intermediary representations until absolutely necessary, you can optimize resource consumption and improve overall query performance.

2
Utilizing downsampling techniques can enhance the user experience in data visualization tools.
By reducing the number of data points returned in queries, you can prevent UI tools from becoming unresponsive, especially when dealing with large datasets.

3
Adopting a DAG representation for queries allows for greater flexibility in supporting multiple query languages.
This approach decouples language parsing from execution, making it easier to integrate new query languages as needed.

Common Pitfalls

1

Overloading the query engine with large queries can lead to memory exhaustion.

This happens when multiple large queries are executed simultaneously, causing the system to run out of memory. Implementing limits on query sizes and optimizing memory usage can help mitigate this issue.

2

Failing to cancel long-running queries can compound performance issues.

When users refresh dashboards, they may inadvertently trigger additional queries that pile up. Implementing a notifier to cancel unnecessary queries can alleviate this problem.

Related Concepts

Time Series Databases

Query Optimization Techniques

Metrics Monitoring And Visualization