Building the Contacts Platform at LinkedIn

Ravneet Singh Khalsa

•

Ravneet Singh Khalsa

•14 min read•advanced•

--

•View Original

JavaMySQLOraclePython

Overview

The article discusses the re-architecture of LinkedIn's contacts and calendar ecosystem, focusing on the migration to a single source of truth for contact data. It highlights the challenges faced during this transition, including maintaining service availability and ensuring data privacy.

What You'll Learn

1

How to migrate large datasets with zero downtime

2

Why a single source of truth is critical for data integrity

3

How to implement personal data routing to reduce costs

4

How to design a scalable NoSQL data model

Prerequisites & Requirements

Understanding of database architectures and data migration strategies
Familiarity with Espresso and Hadoop ecosystems(optional)
Experience with ETL processes and data modeling

Key Questions Answered

What were the challenges faced during the migration to a new contacts platform?

The migration involved challenges such as maintaining service availability, ensuring data privacy, and migrating hundreds of terabytes of data with zero downtime. The team had to keep both legacy and new systems in sync while managing the complexities of dual-writing data.

How did LinkedIn ensure data privacy during the migration?

LinkedIn worked closely with the security team to secure member data during the migration. They implemented workflows to ensure that if a member deleted their contacts, the data would also be deleted from the new system, reinforcing their commitment to data privacy.

What technology was chosen for the new contacts platform and why?

Espresso was chosen as the database platform for its fault-tolerant and distributed NoSQL capabilities, which met the scalability requirements and provided quick lookups for member data without the need for global indexes.

What is personal data routing and how was it utilized?

Personal data routing (PDR) was used to write members' contacts data only in their primary and secondary data centers, significantly reducing hardware costs while ensuring disaster recovery. This approach minimized the need for data replication across all data centers.

Key Statistics & Figures

Data migration size

hundreds of terabytes

This was the scale of data that needed to be migrated to the new contacts platform.

Data accuracy achieved

99.8%

This was the accuracy level reached after validating the dual-write process between the legacy and new systems.

Number of client services impacted

40+

These services needed to be migrated to the new system as part of the re-architecture.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Espresso

Used as the primary database for the new contacts platform due to its scalability and integration within the LinkedIn ecosystem.

Data Processing

Hadoop

Utilized for offline data analysis and ETL processes during the migration.

Messaging

Kafka

Employed for asynchronous data processing in the new architecture.

Workflow Management

Azkaban

Used to schedule and run Map-Reduce scripts for data migration.

Key Actionable Insights

1
Implement a dual-write strategy during data migrations to ensure data consistency across legacy and new systems.
This approach allows for real-time data availability while transitioning to new architectures, minimizing service disruptions.

2
Leverage personal data routing to optimize data storage and reduce costs in distributed systems.
By only storing data in primary and secondary data centers, organizations can significantly cut down on hardware expenses while maintaining data accessibility.

3
Conduct thorough performance analysis during system migrations to identify and resolve bottlenecks.
Regular performance evaluations can help pinpoint issues that may arise during client ramps, ensuring a smoother transition to new systems.

Common Pitfalls

1

Failing to monitor dual-write metrics can lead to data inconsistencies during migrations.

Without proper monitoring, discrepancies may arise between legacy and new systems, complicating the verification process.

2

Not optimizing data models for performance can result in slow query responses.

When migrating to a NoSQL database, it's crucial to redesign schemas to fit the new architecture to avoid performance bottlenecks.

Related Concepts

Data Migration Strategies

Nosql Database Design

Distributed Systems Architecture