DataHub: Popular metadata architectures explained

Shirshanka Das

•

Shirshanka Das

•22 min read•advanced•

--

•View Original

ApacheElasticsearchFlaskGraphQLMonolithMySQLNeo4jOracle

Overview

The article discusses the evolution of metadata architectures, focusing on three generations of data discovery tools. It emphasizes the importance of selecting the right architecture to enhance data discovery and management in organizations, detailing the strengths and weaknesses of each generation.

What You'll Learn

1

How to evaluate different data catalog architectures for your organization

2

Why metadata freshness is critical for effective data management

3

When to implement a push-based architecture for metadata ingestion

Prerequisites & Requirements

Understanding of data management concepts
Experience with data discovery tools(optional)

Key Questions Answered

What are the different generations of metadata architectures?

The article outlines three generations of metadata architectures: first-generation, which is monolithic and pull-based; second-generation, which introduces a service API with push mechanisms; and third-generation, which is event-sourced and allows for real-time metadata updates. Each generation has its own strengths and weaknesses.

Why is metadata freshness important in data catalogs?

Metadata freshness is crucial because outdated metadata can lead to diminished trust in the data catalog. As organizations rely on accurate and timely metadata for data discovery and governance, stale metadata can hinder productivity and decision-making.

How does a push-based architecture improve metadata ingestion?

A push-based architecture allows metadata producers to send updates directly to the metadata catalog, ensuring that the data is current and reducing the operational burden associated with crawling data sources. This leads to more reliable and timely access to metadata.

What are the common use cases for data catalogs?

Common use cases for data catalogs include search and discovery of datasets, access control management, data lineage tracking, compliance with data regulations, and data quality monitoring. Each use case requires specific metadata to be effective.

Key Statistics & Figures

Entity and relationship change events handled daily

Over ten million

This statistic highlights the scale at which DataHub operates, demonstrating its capacity to manage extensive metadata efficiently.

Entities and relationships indexed

More than five million

This figure illustrates the extensive coverage of DataHub's metadata management capabilities, supporting a wide range of data discovery needs.

Technologies & Tools

Metadata Management

Datahub

Used as a third-generation metadata architecture to support diverse data discovery and governance use cases.

Data Ingestion

Apache Gobblin

Utilized for managing data ingestion processes driven by metadata from DataHub.

Key Actionable Insights

1
Evaluate your organization's needs before selecting a data catalog architecture.
Understanding the specific use cases and metadata requirements of your organization will help you choose an architecture that supports your data discovery goals effectively.

2
Implement a push-based metadata ingestion system to enhance data freshness.
By allowing metadata producers to push updates, you can maintain a more accurate and timely catalog, which is essential for effective data governance and operational efficiency.

3
Consider the long-term implications of your data catalog choice.
Since data catalogs are sticky and take time to integrate, selecting the right architecture upfront can save significant time and resources in the future.

Common Pitfalls

1

Relying on a crawling-based ingestion system can lead to stale metadata.

Crawling systems often face operational challenges, such as network issues or configuration changes, which can result in outdated metadata and reduced trust in the data catalog.

2

Centralized metadata teams may struggle to keep up with diverse use cases.

A centralized approach can create bottlenecks, limiting the ability to adapt to the evolving needs of different teams within the organization.

Related Concepts

Data Governance

Metadata Management

Data Discovery Tools

Data Lineage Tracking