Using set cover algorithm to optimize query latency for a large scale distributed graph

Rui Wang
6 min readadvanced
--
View Original

Overview

The article discusses the implementation of a greedy set cover algorithm to optimize query latency in a large-scale distributed graph system at LinkedIn. It highlights the challenges faced in query routing and the enhancements made to the algorithm that resulted in significant performance improvements.

What You'll Learn

1

How to apply a greedy set cover algorithm to optimize query latency in distributed systems

2

Why caching second-degree connections can improve graph query performance

3

When to utilize partitioned graph databases for scalable applications

Prerequisites & Requirements

  • Understanding of distributed systems and graph databases
  • Familiarity with caching mechanisms(optional)

Key Questions Answered

How does the greedy set cover algorithm reduce query latency in distributed graphs?
The greedy set cover algorithm minimizes the number of GraphDB nodes accessed during second-degree cache computation, which reduces latency. By selecting nodes that provide optimal coverage of first-degree connections, the algorithm effectively decreases the number of remote calls needed, thus improving overall query performance.
What improvements were observed after implementing the set cover algorithm?
After implementing the set cover algorithm, the second-degree cache creation time dropped by 38% in the 99th percentile, and there was a 25% decrease in 99th-percentile latency for graph distance queries. This demonstrates significant performance enhancements in query processing.
What are the main components of LinkedIn's distributed graph system?
LinkedIn's distributed graph system consists of three main components: GraphDB, which is a partitioned and replicated graph database; the Network Cache Service (NCS), which stores a member's network; and an API layer that serves as the access point for front-ends. These components work together to handle high query volumes efficiently.

Key Statistics & Figures

Second-degree cache creation time reduction
38%
Observed in the 99th percentile after implementing the set cover algorithm.
Decrease in 99th-percentile latency for graph distance queries
25%
This improvement was noted following the application of the set cover algorithm.

Technologies & Tools

Database
Graphdb
Used as a partitioned and replicated graph database to store member connections.
Backend
Network Cache Service (ncs)
Serves as a caching layer to store second-degree connections and optimize query performance.

Key Actionable Insights

1
Implementing a caching layer for second-degree connections can significantly reduce query latency.
By caching frequently accessed data, such as second-degree connections, systems can avoid costly remote calls, leading to faster response times for users.
2
Utilizing a greedy set cover algorithm can optimize resource usage in distributed systems.
This approach ensures that the minimum number of nodes are accessed to fulfill a query, which is crucial for maintaining performance as the scale of data and user requests increases.
3
Regularly evaluate and enhance algorithms based on system architecture to maintain low latency.
As systems evolve and grow, the initial algorithms may become less efficient. Continuous assessment and adaptation can help sustain optimal performance.

Common Pitfalls

1
Relying solely on classic greedy algorithms without considering system-specific optimizations can lead to latency issues.
Classic implementations may not account for unique system architectures, resulting in unnecessary delays during node discovery.

Related Concepts

Distributed Systems
Graph Databases
Caching Strategies
Algorithm Optimization