Model health assurance platform at LinkedIn

Rajeev Kumar

•

Rajeev Kumar

•10 min read•intermediate•

--

•View Original

V

Overview

The article discusses LinkedIn's Model Health Assurance (HA) platform, which is part of their centralized machine learning platform, Pro-ML. It highlights the importance of monitoring the health of AI models during both training and inference phases to ensure optimal performance and productivity for AI engineers.

What You'll Learn

1

How to monitor the health of AI models during training and inference phases

2

Why data drift monitoring is crucial for maintaining model performance

3

How to implement real-time feature distribution monitoring using InGraphs

4

When to alert AI engineers about significant changes in model performance

Prerequisites & Requirements

Understanding of machine learning model lifecycle
Familiarity with monitoring tools like ThirdEye and InGraphs(optional)

Key Questions Answered

What is the purpose of the Model Health Assurance platform at LinkedIn?

The Model Health Assurance platform at LinkedIn aims to provide AI engineers with tools and systems to quickly identify issues with productionized models. It helps detect symptoms and causes of underperforming models, ensuring a healthy AI ecosystem.

How does LinkedIn monitor data drift in AI models?

LinkedIn monitors data drift by comparing the distribution of input features and prediction variables over time. This allows AI engineers to identify significant changes and alert them to potential issues affecting model performance.

What metrics are used to monitor model inference latency?

Model inference latency is monitored using mean, 50th, 75th, 90th, and 99th percentile latencies. These metrics help identify performance bottlenecks and ensure that models meet service level agreements.

When should AI engineers be alerted about changes in model performance?

AI engineers should be alerted when significant drifts in model output variables or input feature variables are observed. This proactive monitoring helps maintain model performance and reliability in production environments.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Platform

Pro-ml

Centralized machine learning platform hosting AI models at LinkedIn

Monitoring

Thirdeye

In-house alerting and monitoring system for tracking model performance

Visualization

Ingraphs

Internal tool for visualizing feature distributions and monitoring metrics

Data Storage

Pinot

Used for storing and querying feature distribution statistics

Messaging

Kafka

Used for event streaming and metric aggregation

Stream Processing

Samza

Used for aggregating metrics from different hosts

Key Actionable Insights

1
Implement a centralized monitoring system for AI models to enhance productivity.
Centralizing monitoring reduces the effort AI engineers spend on developing individual systems, allowing them to focus on model improvement and deployment.

2
Utilize dark canary testing to identify potential issues before full deployment.
Testing models in a controlled environment helps catch problems early, ensuring that only robust models are released into production.

3
Regularly review and adjust monitoring configurations based on model performance.
As models evolve, their monitoring needs may change. Keeping configurations updated ensures that relevant metrics are tracked effectively.

Common Pitfalls

1

Failing to monitor data drift can lead to significant model performance degradation.

As production data changes over time, models may become less effective if not regularly monitored for drift, leading to poor user experiences.

2

Overloading the monitoring system with too many metrics can cause performance issues.

Collecting excessive metrics can lead to inefficiencies. It's essential to focus on key performance indicators to maintain system responsiveness.

Related Concepts

Machine Learning Lifecycle

Monitoring And Alerting Systems

Data Drift And Model Performance

Real-time Analytics

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy,…

AWSChefEnvoy

14 min read

Includes Code

Has Summary

--

Slack

Advanced

Scaling Datastores at Slack with Vitess

From the very beginning of Slack, MySQL was used as the storage engine for all our data. Slack operated MySQL servers in an active-active configuration. This is the story of how we changed our data storage architecture from the active-active clusters over to Vitess — a horizontal scaling system for MySQL. Vitess is the present…

ReactPHPMySQL

17 min read

Has Summary

--

Oxide Computer Company

Beginner

Exploiting Undocumented Hardware Blocks in the LPC55S69

A write up of the LPC55S69 ROM Patch.

AWSNitroV

14 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Model health assurance platform at LinkedIn". Explore more engineering insights on AWS, Chef, React.