Text analytics on LinkedIn Talent Insights using Apache Pinot

Overview

The article discusses the implementation of keyword search functionality in LinkedIn Talent Insights (LTI) using Apache Pinot. It highlights the challenges faced with existing taxonomy-based searches and how the new text analytics capabilities enhance user experience by allowing more flexible and comprehensive searches across LinkedIn profiles.

What You'll Learn

1

How to implement keyword search functionality in Apache Pinot

2

Why using a text index improves query performance in large datasets

3

When to apply optimizations to reduce latency in text search queries

Prerequisites & Requirements

  • Understanding of SQL queries and text indexing concepts
  • Familiarity with Apache Pinot and its architecture(optional)

Key Questions Answered

How does keyword search enhance LinkedIn Talent Insights?
Keyword search allows users to search for any text present on LinkedIn profiles, overcoming limitations of the existing taxonomy-based search. This flexibility enables users to generate a wider variety of talent pool metrics, making it easier to analyze emerging skills and technologies.
What are the performance improvements achieved with text indexing in Pinot?
The implementation of text indexing in Apache Pinot resulted in query performance improvements by multiple orders of magnitude, significantly reducing latency for text search queries. This enhancement is crucial given the scale of data, with average raw text column sizes reaching 400GB per table.
What optimizations were made to improve text search query performance?
Optimizations included reducing heap overhead by implementing a custom Collector interface for Lucene, pruning stop words to decrease index size, and mapping Lucene docIds to Pinot docIds to streamline query execution. These changes resulted in a 40-50x improvement in query performance.

Key Statistics & Figures

Average raw text column size
400GB
Per Pinot table, indicating the scale of data being handled.
Performance improvement factor
40-50x
Improvement in query performance after optimizations were implemented.
Percentage of searches utilizing keyword search
20%
Since the launch of the keyword search feature in LTI.

Technologies & Tools

Database
Apache Pinot
Used as the OLAP datastore for computing talent metrics and supporting keyword search functionality.
Search Library
Apache Lucene
Evaluated for text search capabilities and integrated into Pinot for enhanced text indexing.

Key Actionable Insights

1
Implementing text indexing in your data analytics platform can drastically enhance search capabilities.
By allowing users to perform keyword searches, you can provide more relevant results and insights, which is particularly beneficial in talent analytics where understanding emerging skills is critical.
2
Regularly optimize your query performance by analyzing execution plans and adjusting indexing strategies.
As data volumes grow, maintaining performance becomes challenging. Continuous monitoring and optimization can help ensure that your system meets user expectations for speed and responsiveness.

Common Pitfalls

1
Neglecting to optimize query performance can lead to significant latency issues as data volume increases.
As seen in the article, without proper optimizations, the performance of existing queries can degrade, impacting user experience and system efficiency.

Related Concepts

Text Analytics
Keyword Search Functionality
Data Performance Optimization
Apache Pinot Architecture