Open sourcing Feathr – LinkedIn’s feature store for productive machine learning

David Stein

•

David Stein

•8 min read•intermediate•

--

•View Original

AzureMachine Learning

Overview

The article discusses the open sourcing of Feathr, LinkedIn's feature store designed to simplify machine learning feature management and enhance developer productivity. It highlights the challenges of scaling feature pipelines and presents Feathr as a solution that allows for easier feature sharing and improved performance across various machine learning applications.

What You'll Learn

1

How to define and register features using Feathr

2

Why using a feature store like Feathr can improve ML productivity

3

How to implement point-in-time correct feature computation for model training

Prerequisites & Requirements

Understanding of machine learning feature management concepts
Familiarity with GitHub for accessing Feathr(optional)

Key Questions Answered

What is Feathr and how does it improve machine learning workflows?

Feathr is a feature store that provides a common namespace for defining and managing machine learning features. It simplifies the process of feature sharing across projects, reduces the complexity of individual feature pipelines, and enhances productivity by allowing teams to focus on feature engineering rather than pipeline maintenance.

What challenges does LinkedIn face with feature pipelines?

LinkedIn's teams faced challenges with maintaining redundant feature preparation pipelines, which increased costs and complexity. Each team had its own pipeline, making it difficult to share features and leading to inefficiencies in managing machine learning models at scale.

How does Feathr enable point-in-time correct feature computation?

Feathr allows features to be defined and registered based on raw data or existing features. When features are imported into model workflows, their definitions are replayed over historical time-series data, ensuring that features are computed correctly for both training and inference contexts.

What performance improvements have been observed with Feathr?

Feathr has been reported to perform faster than previous custom feature processing pipelines by as much as 50%. This improvement has led to significant reductions in engineering time required for adding and experimenting with new features.

Key Statistics & Figures

Performance improvement

50%

Feathr has been shown to perform faster than custom feature processing pipelines, leading to significant efficiency gains.

Reduction in engineering time

From weeks to days

Using Feathr has allowed teams to significantly decrease the time required to add and experiment with new features.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Feature Store

Feathr

Used for managing and serving machine learning feature data.

Version Control

Github

Feathr's code and documentation are available for access and collaboration.

Key Actionable Insights

1
Utilize Feathr to streamline your machine learning feature management processes.
By adopting Feathr, teams can reduce the time spent on maintaining custom feature pipelines, allowing them to focus on innovation and improving their applications.

2
Leverage the shared feature namespace in Feathr to enhance collaboration across teams.
This collaboration can lead to better feature reuse and improved performance metrics, as teams can easily access and integrate features developed by others.

3
Implement point-in-time correct feature computation to ensure model accuracy.
This approach is crucial for avoiding training-serving skew, which can degrade model performance if features are not consistently computed.

Common Pitfalls

1

Failing to standardize feature definitions across teams can lead to inconsistencies and inefficiencies.

Without a common abstraction for features, teams may struggle with feature reuse and face increased complexity in their machine learning workflows.

Related Concepts

Feature Engineering

Machine Learning Pipelines

Data Management In ML