Open-sourcing Terrapin: A serving system for batch generated data

Pinterest Engineering
7 min readadvanced
--
View Original

Overview

The article discusses the open-sourcing of Terrapin, a serving system designed to handle large data sets generated by Hadoop jobs. It highlights the challenges faced with Apache HBase and how Terrapin provides low latency access, data locality, and improved performance for various online applications at Pinterest.

What You'll Learn

1

How to implement Terrapin for serving batch generated data

2

Why data locality is crucial for performance in distributed systems

3

When to use HDFS over HBase for large data sets

Prerequisites & Requirements

  • Understanding of Hadoop and MapReduce concepts
  • Familiarity with HDFS and HBase(optional)

Key Questions Answered

What are the main advantages of using Terrapin over HBase?
Terrapin provides low latency random key-value access, improved data locality, and better performance for large data sets compared to HBase, which struggles with larger data sizes and incurs high costs for writes. Terrapin's architecture allows it to serve immutable data sets efficiently, leveraging HDFS's strengths.
How does Terrapin ensure data locality?
Terrapin achieves data locality by deploying serving shards on the same nodes that store the HDFS blocks. This design choice minimizes latency and improves performance, as data is accessed locally rather than being distributed across the cluster.
What features does Terrapin offer for managing data?
Terrapin supports features such as filesets for namespacing data, live swaps for seamless updates, and the ability to ingest data from S3, HDFS, or directly from Hadoop jobs. It also allows for easy adjustment of output shards and supports multiple versions for rollback capabilities.
What is the performance of Terrapin in production?
Terrapin has been running in production at Pinterest, serving 1.5 million queries per second (QPS) with a server-side 99th percentile latency of less than 5 milliseconds. It currently stores approximately 180 terabytes of data across about 100 filesets.

Key Statistics & Figures

Queries per second (QPS)
1.5 million
This metric reflects the performance of Terrapin in production at Pinterest.
Server-side 99th percentile latency
< 5 ms
Indicates the responsiveness of Terrapin under load.
Data stored
180 TB
The total amount of data managed by Terrapin across its deployment.
Number of filesets
100
Represents the organization of data within the Terrapin system.

Technologies & Tools

Storage
Hdfs
Used as the underlying storage system for Terrapin, providing elasticity and fault tolerance.
Database
Hbase
Previously used for serving batch-generated data before transitioning to Terrapin.
Coordination
Zookeeper
Used for managing the cluster state and coordination between Terrapin servers.
Coordination
Apache Helix
Utilized for ZooKeeper-based cluster coordination.

Key Actionable Insights

1
Implementing Terrapin can significantly reduce latency in serving batch-generated data, making it suitable for high-performance applications.
This is particularly beneficial for companies dealing with large data sets and requiring quick access times, as Terrapin's architecture is optimized for low-latency operations.
2
Utilizing HDFS as the underlying storage for Terrapin enhances elasticity and fault tolerance.
This choice allows for seamless integration with existing Hadoop workflows, making it easier to manage large-scale data processing and serving.
3
Terrapin's support for live swaps and versioning can help maintain service continuity during data updates.
This feature is crucial for applications that cannot afford downtime, allowing for quick rollbacks in case of issues with new data loads.

Common Pitfalls

1
Failing to ensure data locality can lead to increased latency and performance issues.
This often occurs in distributed systems where data is not stored close to where it is processed, leading to longer access times and inefficiencies.
2
Overcomplicating the architecture by not leveraging existing tools like HDFS and ZooKeeper.
Using overly complex solutions can hinder scalability and maintainability, especially when simpler, well-integrated options are available.

Related Concepts

Hadoop And Mapreduce Workflows
Data Locality In Distributed Systems
Batch Processing Vs. Real-time Processing