Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•12 min read•advanced•

--

•View Original

Fine-tuningReinforcement LearningRLHF

Overview

Netflix introduces Advantage-Weighted Supervised Fine-Tuning (A-SFT), a novel post-training algorithm for generative recommender systems that addresses the unique challenges of applying reinforcement learning techniques to recommendations. The method combines supervised fine-tuning with advantage reweighting to handle noisy reward models, lack of counterfactual data, and unknown logging policies — outperforming PPO, DPO, IPO, and CQL baselines across multiple recommendation quality metrics.

What You'll Learn

1

Why traditional RLHF methods like PPO and DPO are poorly suited for recommendation systems due to lack of counterfactual data and noisy rewards

2

How to apply Advantage-Weighted Supervised Fine-Tuning (A-SFT) to post-train generative recommender models without requiring inverse propensity scoring

3

When to choose between online RL, offline RL, and behavior cloning based on reward model accuracy and generalization ability

4

How reward models in recommendation systems differ from language model reward models in terms of noise and generalization

5

How to benchmark post-training algorithms using NDCG@k, HR@k, MRR, and reward model ensemble evaluation

Prerequisites & Requirements

Understanding of transformer architectures and sequential modeling
Familiarity with reinforcement learning concepts including PPO, DPO, reward models, and advantage functions
Understanding of recommendation system fundamentals including ranking, retrieval, and user behavior modeling
Knowledge of supervised fine-tuning and behavior cloning techniques
Experience with large-scale ML training at production scale (millions of users, billions of tokens)(optional)

Key Questions Answered

Why can't traditional RLHF methods like PPO and DPO be directly applied to recommendation systems?

Traditional RLHF methods require counterfactual feedback — evaluating alternative recommendations a user didn't see. In recommendation systems, user sequences span weeks or years of real-time activity, making it impractical to collect feedback on hypothetical experiences. Additionally, reward models in recommendations are much noisier than in language tasks because user behavior exhibits permutation invariance and lacks the structural rules of language.

What is Advantage-Weighted Supervised Fine-Tuning (A-SFT) and how does it work for recommender systems?

A-SFT is a post-training algorithm that combines supervised fine-tuning with advantage reweighting from reinforcement learning. Despite individual reward estimates having high uncertainty, A-SFT leverages directional signals between high-reward and low-reward events. It does not require inverse propensity scoring or knowledge of the logging policy, and includes a tunable parameter to control policy deviation, making it robust against noisy reward models.

How do reward models in recommendation systems compare to reward models in language models?

Reward models in recommendation systems have significantly higher uncertainty and noise compared to language model counterparts. Ablation studies show that a reward model trained on millions of users and billions of tokens does not significantly outperform simple baselines like predicting average user reward or average title reward. This is because users only interact with a small subset of available titles, making generalization to unexplored content extremely difficult.

How does A-SFT compare to PPO, DPO, IPO, and CQL for post-training generative recommenders?

A-SFT outperforms all baselines across recommendation metrics (NDCG@k, HR@k, MRR) and reward scores. PPO, IPO, and DPO achieve good reward scores but cause overfitting from the reward model. CQL achieves more robust improvements but doesn't fully capture potential reward signals. A-SFT strikes the best balance by leveraging directional reward signals while being less dependent on reward accuracy.

What are the three main challenges of applying post-training to recommendation systems?

The three challenges are: (1) lack of counterfactual observations — users cannot evaluate alternative recommendation histories; (2) noisy reward models — user behavior has higher randomness than language, with permutation invariance making reward prediction difficult; (3) lack of logged policy — the policy generating historical data is unknown, making inverse propensity scoring unreliable and introducing high variance estimates.

What is the role of the advantage function in A-SFT versus raw reward weighting?

The advantage function in A-SFT weighs the supervised fine-tuning loss based on relative advantage rather than raw rewards, which is what Reward Weighted Behavior Cloning uses. This distinction is critical because the advantage captures directional signals between high and low reward events more effectively than raw rewards, especially when reward estimates have high variance but still contain meaningful relative ordering information.

What evaluation metrics are used to benchmark post-training algorithms for generative recommenders?

Four metrics are used: NDCG@k measures ranking quality with logarithmic position discounting normalized against ideal ordering; HR@k measures the proportion of cases where the ground-truth item appears in top-k recommendations; MRR measures the reciprocal rank of the chosen item averaged across test cases; and a Reward Model as Judge metric using an ensemble of reward models to evaluate discounted future rewards with less than 4% standard deviation.

Key Statistics & Figures

Training dataset scale (users)

O

Millions

Training dataset scale (tokens)

O

Billions

Test set size

O

Millions

Reward scale

1 to 5

Proxy reward scoring scale used for the reward model

Reward Model as Judge standard deviation

Less than 4%

Based on ensemble of reward models evaluating discounted reward over a few steps

Technologies & Tools

ML Architecture

Hstu

Open-source generative recommender architecture used as the base model for the study

Rl Algorithm

Ppo

Standard RLHF algorithm used as a baseline for online reinforcement learning comparison

Rl Algorithm

Dpo

Direct Preference Optimization used as a baseline, applied with rejection sampling variant

Rl Algorithm

Ipo

Identity Preference Optimization used as a baseline alongside DPO with rejection sampling

Rl Algorithm

Cql

Conservative Q-Learning used as an offline RL baseline that penalizes Q-value overestimation

ML Architecture

Transformer

Foundation architecture inspiring generative recommenders for sequential transduction tasks

Key Actionable Insights

1
When reward models have high variance but contain directional signals, use advantage-weighted approaches rather than directly optimizing the reward. A-SFT demonstrates that combining supervised fine-tuning with advantage reweighting captures useful relative ordering between high and low reward events without over-exploiting noisy absolute reward estimates.
This is particularly relevant for recommendation systems where reward signals like watch time or click-through rates are inherently noisy and don't always reflect true user satisfaction.

2
Before applying complex RL post-training methods, evaluate your reward model against simple baselines such as average user reward and average item reward. Netflix's ablation study revealed their reward model trained on millions of users didn't significantly outperform these trivial baselines, which fundamentally changes which post-training algorithm is appropriate.
This baseline comparison should be standard practice when the ratio between explored and unexplored items is very small, as is typical in large-scale recommendation systems.

3
Match your post-training algorithm to your reward model's generalization ability rather than defaulting to popular LLM techniques. Online RL (PPO) works well with high-generalization reward models, while A-SFT occupies the sweet spot for moderate-noise scenarios, and pure behavior cloning is appropriate when no reliable reward model exists.
The landscape diagram in the article maps reward model accuracy to appropriate algorithm selection, providing a decision framework for practitioners choosing post-training strategies.

4
Avoid inverse propensity scoring (IPS) in recommendation post-training when the logging policy is unknown or difficult to estimate. IPS introduces high-variance estimates that can degrade model performance. A-SFT provides an alternative that controls policy deviation through a tunable parameter without requiring knowledge of the logging policy.
This is especially important in production recommendation systems where multiple policies may have generated the logged data over time, creating distribution shifts that make logging policy estimation unreliable.

5
Use ensemble reward models for evaluation rather than single reward models to increase confidence in offline evaluation results. Netflix uses an ensemble approach for their 'Reward Model as a Judge' metric, achieving standard deviation below 4%, which provides more reliable benchmarking of post-training methods.
This ensemble evaluation approach is critical because the same reward model noise that motivates A-SFT also makes single-model evaluation unreliable for comparing algorithm performance.

6
Be cautious of reward model overfitting when applying PPO, DPO, or IPO to recommendation systems. These methods achieved good reward scores in Netflix's experiments but caused overfitting, meaning the model exploited reward model weaknesses rather than genuinely improving recommendation quality as measured by NDCG, HR, and MRR.
This overfitting risk is amplified in recommendation domains where reward models have poor generalization compared to language domains, making the gap between reward optimization and true quality improvement larger.

Common Pitfalls

1

Over-exploiting noisy reward models by applying online RL methods like PPO to recommendation systems where reward model generalization is poor. PPO, DPO, and IPO achieved good reward scores but caused overfitting, meaning the model learned to game the reward model's weaknesses rather than improving genuine recommendation quality.

This happens because recommendation reward models have fundamentally higher uncertainty than language model reward models due to the small ratio of explored to unexplored items and the inherent randomness of user behavior.

2

Assuming watch time or engagement metrics directly reflect user satisfaction when building reward models. A user might stop watching a favorite show due to time constraints, while finishing a lengthy show doesn't necessarily indicate enjoyment. Using these noisy proxies as ground truth can mislead the reward model and downstream post-training.

The article emphasizes that implicit feedback signals in recommendation systems are inherently ambiguous, requiring methods like A-SFT that are robust to this noise rather than methods that amplify it.

3

Applying inverse propensity scoring (IPS) for debiasing when the logging policy is unknown or multi-version. In production recommendation systems, the policy that generated logged data is typically unknown and cannot be directly estimated. IPS estimation errors introduce additional biases and suffer from high variance, making offline RL approaches that depend on IPS ill-suited.

This is a systemic issue in recommendation systems where the production policy changes over time through model updates, A/B tests, and business rule changes, making any single logging policy estimate unreliable.

4

Treating recommendation sequences identically to language sequences when applying LLM post-training techniques. Unlike language where grammar rules create strong sequential dependencies, user choices exhibit permutation invariance — swapping the order of events often still produces a valid activity sequence. This fundamental difference makes language-derived reward models less effective for recommendations.

This structural difference between language and recommendation data means that methods proven effective for LLM alignment cannot be directly transferred without accounting for the higher randomness inherent in user behavior.

Related Concepts

Generative Recommender Systems

Rlhf (reinforcement Learning From Human Feedback)

Grpo (group Relative Policy Optimization)

Behavior Cloning

Contextual Bandits

Inverse Propensity Scoring

Sequential Transduction

Reward Model Generalization

Offline Reinforcement Learning

Ndcg (normalized Discounted Cumulative Gain)

Onerec

Rejection Sampling

Distribution Shift In Recommendations

User Feedback Loops