Metacat: Making Big Data Discoverable and Meaningful at Netflix

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•8 min read•intermediate•

--

•View Original

ApacheElasticsearchSQLThrift

Overview

The article discusses Metacat, a metadata service developed by Netflix to enhance the discoverability and management of big data across various data sources. It outlines the architecture, objectives, and functionalities of Metacat, emphasizing its role in providing a unified interface for metadata access and facilitating data interoperability.

What You'll Learn

1

How to implement a federated metadata access layer for big data

2

Why data abstraction is crucial for interoperability among different data processing engines

3

How to utilize Elasticsearch for full-text search in data discovery

4

When to use business and user-defined metadata for enhanced data management

Prerequisites & Requirements

Understanding of big data concepts and metadata management
Familiarity with data processing engines like Spark, Hive, and Presto(optional)

Key Questions Answered

What is Metacat and how does it function?

Metacat is a federated service that provides a unified REST/Thrift interface for accessing metadata from various data stores. It does not store schema metadata but retains business and user-defined metadata, facilitating data discovery and interoperability across different processing engines.

How does Metacat enhance data discovery at Netflix?

Metacat publishes schema and business/user-defined metadata to Elasticsearch, allowing for full-text search capabilities. This enables users to easily browse datasets, utilize auto-suggest features in SQL editors, and categorize data effectively using tags.

What optimizations have been made to the Hive metastore in Metacat?

Metacat has improved the Hive connector to interact directly with the backend RDS for reading and writing partitions, eliminating timeout issues that occurred with the original Hive metastore APIs under high load.

What are the main objectives of Metacat?

Metacat aims to provide federated views of metadata systems, a unified API for dataset metadata, and storage for arbitrary business and user metadata. These objectives help streamline data management and enhance accessibility across various data sources.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Search Engine

Elasticsearch

Used for full-text search capabilities and metadata indexing in Metacat.

Storage

Amazon S3

Primary data storage solution for Netflix's data warehouse.

Data Processing

Hive

Used as an ad-hoc querying language and for ETL processes.

Data Processing

Spark

One of the compute engines supported by Metacat for processing datasets.

Data Processing

Presto

Another compute engine that integrates with Metacat for data access.

Database

RDS

Backend database for the Hive metastore.

Key Actionable Insights

1
Implement a unified metadata access layer to streamline data operations across multiple data sources.
This approach reduces complexity and enhances data interoperability, allowing different processing engines to access and utilize datasets without compatibility issues.

2
Utilize Elasticsearch for metadata storage to improve data discovery and search capabilities.
By indexing metadata in Elasticsearch, teams can leverage full-text search and auto-suggest features, making it easier for users to find and utilize datasets.

3
Adopt a push notification system for data change events to enhance data pipeline responsiveness.
This allows dependent jobs to react to data updates in real-time, improving the efficiency of ETL processes and overall data management.

Common Pitfalls

1

Over-reliance on the Hive metastore APIs can lead to performance issues under high load.

This often results in timeouts and slow operations. Instead, directly interfacing with the backend database for partition management can significantly improve performance.

Related Concepts

Metadata Management

Data Interoperability

Big Data Architecture

Data Discovery Techniques