Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 1)

Pinterest Engineering

•

Pinterest Engineering

•16 min read•intermediate•

--

•View Original

AWS

Overview

This article discusses the improvements made to the Goku time series database at Pinterest, focusing on enhancing efficiency and user experience. Key changes include a shift from a push-based to a pull-based ingestion model, which significantly reduced recovery times and improved query handling.

What You'll Learn

1

How to implement a pull-based ingestion model for time series data

2

Why reducing recovery times is crucial for database performance

3

When to apply shard-aware routing for efficient query handling

Prerequisites & Requirements

Understanding of time series databases and ingestion models
Familiarity with Kafka and RocksDB(optional)

Key Questions Answered

How did Pinterest reduce the recovery time of GokuS clusters?

Pinterest reduced the recovery time of GokuS clusters from 90-120 minutes to under 40 minutes by switching from EFS to local disk for persistent data and implementing a pull-based ingestion model. This change minimized the need for additional computation during recovery, allowing for faster data access and processing.

What are the main components of the Goku time series database?

The Goku time series database consists of several components: Goku Short Term (GokuS) for in-memory storage, Goku Long Term (GokuL) for SSD and HDD storage, Goku Compactor for data aggregation, and Goku Root for smart query routing. Each component plays a critical role in managing time series data efficiently.

What challenges did Pinterest face with the initial GokuS architecture?

Pinterest faced challenges such as longer recovery times during host replacements and deployments, as well as a single point of failure due to health inference at the cluster level. These issues were primarily caused by the architecture's reliance on EFS and the push-based ingestion model.

What improvements were made to the Goku ingestion model?

The ingestion model was changed from a push-based system to a pull-based shard-aware model. This allowed GokuS to directly pull data from Kafka partitions, reducing recovery times and improving the efficiency of data ingestion and query handling.

Key Statistics & Figures

Recovery time reduction

from 90-120 minutes to under 40 minutes

This improvement was achieved by changing the storage method and ingestion model.

Throughput limit of EFS

1024 Megabytes per second

This limit affected recovery times when multiple hosts were recovering simultaneously.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Messaging

Kafka

Used for handling data points during the ingestion process.

Database

Rocksdb

Utilized for time series data storage in GokuL.

Storage

AWS Efs

Initially used for persistent storage before transitioning to local disk.

Key Actionable Insights

1
Transitioning to a pull-based ingestion model can significantly enhance data recovery times.
This approach minimizes the computational overhead during recovery, allowing for faster access to time series data and improving overall system performance.

2
Implementing shard-aware routing can optimize query handling in distributed databases.
By routing queries based on shard health, systems can reduce unnecessary load on replicas, leading to improved response times and resource utilization.

3
Utilizing local disk storage instead of network-based solutions like EFS can enhance performance.
Local storage reduces latency and increases throughput during data recovery, which is crucial for maintaining high availability in time-sensitive applications.

Common Pitfalls

1

Relying on a single storage solution like EFS can create bottlenecks during recovery.

This can lead to increased recovery times, especially under high load conditions. Transitioning to local storage can mitigate these issues.

2

Not implementing shard-aware routing can lead to inefficient query handling.

Without this, queries may unnecessarily overload certain replicas, causing delays and performance degradation.

Related Concepts

Time Series Databases

Data Ingestion Models

Distributed Systems