Metacat: Making Big Data Discoverable and Meaningful at Netflix

Netflix Technology Blog
8 min readintermediate
--
View Original

Overview

The article discusses Metacat, a metadata service developed by Netflix to enhance the discoverability and management of big data across various data sources. It outlines the architecture, objectives, and functionalities of Metacat, emphasizing its role in providing a unified interface for metadata access and facilitating data interoperability.

What You'll Learn

1

How to implement a federated metadata access layer for big data

2

Why data abstraction is crucial for interoperability among different data processing engines

3

How to utilize Elasticsearch for full-text search in data discovery

4

When to use business and user-defined metadata for enhanced data management

Prerequisites & Requirements

  • Understanding of big data concepts and metadata management
  • Familiarity with data processing engines like Spark, Hive, and Presto(optional)

Key Questions Answered

What is Metacat and how does it function?
Metacat is a federated service that provides a unified REST/Thrift interface for accessing metadata from various data stores. It does not store schema metadata but retains business and user-defined metadata, facilitating data discovery and interoperability across different processing engines.
How does Metacat enhance data discovery at Netflix?
Metacat publishes schema and business/user-defined metadata to Elasticsearch, allowing for full-text search capabilities. This enables users to easily browse datasets, utilize auto-suggest features in SQL editors, and categorize data effectively using tags.
What optimizations have been made to the Hive metastore in Metacat?
Metacat has improved the Hive connector to interact directly with the backend RDS for reading and writing partitions, eliminating timeout issues that occurred with the original Hive metastore APIs under high load.
What are the main objectives of Metacat?
Metacat aims to provide federated views of metadata systems, a unified API for dataset metadata, and storage for arbitrary business and user metadata. These objectives help streamline data management and enhance accessibility across various data sources.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a unified metadata access layer to streamline data operations across multiple data sources.
This approach reduces complexity and enhances data interoperability, allowing different processing engines to access and utilize datasets without compatibility issues.
2
Utilize Elasticsearch for metadata storage to improve data discovery and search capabilities.
By indexing metadata in Elasticsearch, teams can leverage full-text search and auto-suggest features, making it easier for users to find and utilize datasets.
3
Adopt a push notification system for data change events to enhance data pipeline responsiveness.
This allows dependent jobs to react to data updates in real-time, improving the efficiency of ETL processes and overall data management.

Common Pitfalls

1
Over-reliance on the Hive metastore APIs can lead to performance issues under high load.
This often results in timeouts and slow operations. Instead, directly interfacing with the backend database for partition management can significantly improve performance.

Related Concepts

Metadata Management
Data Interoperability
Big Data Architecture
Data Discovery Techniques