Building the Activity Graph, Part 2

Vivek N.

•

Vivek N.

•12 min read•intermediate•

--

•View Original

ApacheJavaREST API

Overview

This article discusses the storage and retrieval mechanisms of the Activity Graph at LinkedIn, focusing on the FollowFeed system. It highlights the challenges of efficient data management and the implementation of the EntityFeatureStore (EFS) for optimized performance.

What You'll Learn

1

How to implement a normalized storage system for feature data in distributed applications

2

Why using bloom filters can enhance performance in sparse data environments

3

How to optimize caching strategies to reduce latency in data retrieval

Prerequisites & Requirements

Understanding of caching mechanisms and distributed systems
Familiarity with RocksDB and caching libraries like Caffeine(optional)

Key Questions Answered

How does FollowFeed manage data for efficient retrieval?

FollowFeed manages data by indexing all Activities in a time-ordered list for each member and company, ensuring that frequently accessed data is kept in memory. This design allows for quick responses to complex queries while minimizing the need for disk access, which can slow down performance.

What role do bloom filters play in the EntityFeatureStore?

Bloom filters in the EntityFeatureStore help optimize memory usage by indicating whether a key may exist in the cache before performing a lookup. This significantly reduces unnecessary disk access and enhances performance, especially in environments with sparse data.

What challenges arise from de-normalized feature storage?

De-normalized feature storage can lead to excessive memory consumption and complexity when updating shared data across multiple TimelineRecords. This approach can also slow down performance due to the need for multiple updates whenever a single piece of data changes.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Rocksdb

Used for on-disk storage in the EntityFeatureStore.

Cache

Caffeine

Utilized for caching feature data to improve retrieval speeds.

Key Actionable Insights

1
Implement a normalized storage system for feature data to improve memory efficiency.
By storing feature data in a normalized form, you can reduce redundancy and simplify updates, which is crucial for maintaining performance in large-scale applications.

2
Utilize bloom filters to enhance lookup performance in sparse datasets.
Bloom filters can drastically reduce the number of disk accesses required during data retrieval, making them essential for applications that frequently query large datasets.

3
Adopt a write-through caching strategy to ensure data consistency.
This approach helps maintain data integrity across your caching layers, ensuring that updates are reflected immediately in both the cache and the underlying data store.

Common Pitfalls

1

Storing features in a de-normalized manner can lead to high memory usage and complex updates.

This occurs because every instance of a feature must be updated individually, which can introduce errors and slow down the system. To avoid this, consider using normalized storage strategies.

Related Concepts

Caching Strategies

Distributed Systems

Data Normalization

Bloom Filters