Evolving LinkedIn’s analytics tech stack

LinkedIn Engineering Team
10 min readintermediate
--
View Original

Overview

The article discusses LinkedIn's transition from a proprietary analytics tech stack to an open-source big data technology stack, detailing the challenges faced and the improvements made during the migration. It highlights the benefits of this transition, including cost savings and enhanced control over data management.

What You'll Learn

1

How to effectively plan a large-scale data migration using dataset lineage

2

Why transitioning to an open-source tech stack can improve data management and reduce costs

3

How to automate user migration and dataset deprecation processes

Prerequisites & Requirements

  • Understanding of data migration concepts and practices
  • Familiarity with Hadoop and data pipeline tools(optional)

Key Questions Answered

What were the main challenges faced during LinkedIn's analytics tech stack migration?
The migration faced challenges such as maintaining dataset lineage, managing user migration, and ensuring minimal disruption to existing services. The team had to navigate the complexities of transitioning from a proprietary system to an open-source framework while ensuring data integrity and performance.
How did LinkedIn improve its data ecosystem during the migration?
LinkedIn took the opportunity to enhance its data models and pipeline performance, consolidating 1424 datasets down to 450, which cut the migration workload by approximately 70%. This re-evaluation allowed for better data management and streamlined analytics.
What tools did LinkedIn develop to assist with dataset lineage and usage?
LinkedIn created a data product to provide upstream and downstream dependencies of datasets and built data pipelines to extract usage information. This was essential for planning the migration and ensuring efficient data management.
What benefits did LinkedIn gain from transitioning to an open-source tech stack?
The transition allowed for democratization of data access, enabling more teams to develop data pipelines. It also provided the freedom to innovate and enhance the tech stack without the constraints of proprietary systems, leading to significant cost savings.

Key Statistics & Figures

Number of datasets consolidated
1424 to 450
This consolidation reduced the migration workload by approximately 70%.
Number of users migrated
2100+
The migration included orchestrating the transition of over 2100 users from the old system.
Improvement in read speed after migrating to ORC
10-1000x
The migration to ORC format significantly enhanced read performance compared to the previous Avro format.
Reduction in runtime after migrating to Spark
~80%
Migrating poorly performing Hive/Pig flows to Spark resulted in substantial runtime reductions.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Hadoop
Used as the foundation for LinkedIn's new analytics tech stack.
Data Format
Orc
Migrated from Avro to ORC to improve read performance.
Data Processing
Spark
Replaced poorly performing Hive/Pig flows to enhance processing efficiency.
Database
Mysql
Used as the operational datastore for coordinating user migration and dataset deprecation.

Key Actionable Insights

1
Leverage dataset lineage tools to streamline data migration processes.
Understanding the dependencies and usage of datasets can significantly reduce complexity during migrations and help prioritize which datasets to move first.
2
Consider the long-term benefits of transitioning to an open-source tech stack.
While the initial transition may require significant effort, the flexibility and cost savings in the long run can greatly enhance your organization's data capabilities.
3
Automate user migration and dataset deprecation to minimize manual errors.
Automation can save time and reduce the risk of human error, making the migration process more efficient and less costly.

Common Pitfalls

1
Failing to account for dataset lineage can complicate migration efforts.
Without understanding the dependencies between datasets, migrations can lead to data integrity issues and increased complexity, making it harder to manage the transition.
2
Underestimating the resources needed for a technology transition.
Transitioning from a proprietary system to an open-source stack often requires significant engineering efforts to match performance and features, which can be overlooked during planning.

Related Concepts

Data Migration Strategies
Open-source Technology Benefits
Data Pipeline Optimization Techniques