Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

Kerem Sahin

•

Kerem Sahin

•15 min read•advanced•

--

•View Original

ApacheAWSAzureDependency InjectionDockerElasticsearchGoogle CloudJSONKubernetesMicroservicesMongoDBMySQLNeo4jOracleSpring

Overview

The article discusses LinkedIn's open sourcing of DataHub, a metadata search and discovery platform, detailing its development journey from WhereHows to DataHub. It covers the challenges faced in maintaining separate internal and open-source versions, the new automated solutions for syncing codebases, and the architecture of the open-source DataHub.

What You'll Learn

1

How to set up and run the open-source DataHub using Docker

2

Why maintaining separate codebases for open source and internal use is necessary

3

How to automate syncing between internal and open-source repositories

Key Questions Answered

What challenges did LinkedIn face when open sourcing DataHub?

LinkedIn faced challenges in maintaining two separate codebases for DataHub: one for internal use and another for open source. This was due to internal dependencies and features that were not applicable to a broader audience, making synchronization difficult and leading to stale open-source versions.

How does LinkedIn automate contributions to the open-source DataHub?

LinkedIn developed tooling that automatically syncs the internal codebase with the open-source repository. This includes features like syncing code, generating license headers, auto-generating commit logs, and performing dependency testing to prevent breaking changes in the open-source build.

What are the main differences between open-source DataHub and LinkedIn's production version?

The production version of DataHub has dependencies on internal code not available in the open-source version, such as LinkedIn's Offspring framework. Additionally, the open-source version uses a single Generalized Metadata Store (GMS) for ease of use, while the production version utilizes multiple GMS instances.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Containerization

Docker

Used for deploying and distributing applications in the open-source DataHub.

Stream Processing

Kafka

Utilized for consuming metadata change events and building search indexes.

Search Engine

Elasticsearch

Part of the infrastructure components for DataHub.

Database

Mysql

Used as part of the DataHub architecture.

Key Actionable Insights

1
Implementing an automated syncing tool can significantly reduce the overhead of maintaining separate codebases for open-source and internal projects.
This approach not only streamlines the development process but also ensures that the open-source community receives timely updates and improvements, fostering better collaboration.

2
Using Docker for microservices in DataHub simplifies deployment and distribution, making it easier for developers to set up their environments.
This is especially beneficial for new users who may not have extensive experience with complex infrastructure setups, allowing them to focus on using the platform effectively.

Common Pitfalls

1

Failing to maintain synchronization between open-source and internal codebases can lead to stale versions and missed updates.

This often occurs when teams do not have automated processes in place, resulting in increased manual effort and potential for errors.

Related Concepts

Metadata Management

Data Discovery

Microservices Architecture

Continuous Integration And Deployment