Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

Jian Hu

When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the most compelling questions is how to overcome…

NVIDIA

•

Jian Hu

•6 min read•intermediate•

--

•View Original

Hugging FaceReinforcement Learning

Overview

The article introduces Broadened Reinforcement Learning (BroRL), a new paradigm that enhances the training of large language models (LLMs) by focusing on rollout scaling rather than just increasing training steps. This approach addresses performance plateaus encountered in previous methods and demonstrates significant improvements in efficiency and model performance.

What You'll Learn

1

How to implement BroRL for training large language models

2

Why rollout scaling is crucial for overcoming performance plateaus in reinforcement learning

3

How to achieve higher compute efficiency with BroRL compared to ProRL

Key Questions Answered

What is Broadened Reinforcement Learning (BroRL) and how does it improve LLM training?

BroRL is a new approach that enhances LLM training by increasing the number of exploratory rollouts per prompt, allowing for more efficient data usage and breaking through performance plateaus. It contrasts with previous methods that focused solely on increasing training steps, which often led to diminishing returns.

How does BroRL compare to ProRL in terms of performance and efficiency?

BroRL outperforms ProRL by achieving significant performance gains in less time. For example, BroRL surpassed ProRL's final performance across all metrics after just 98.1 hours, demonstrating its efficiency in training large models.

What are the key benefits of using rollout scaling in reinforcement learning?

Rollout scaling provides a stable learning signal, enhances exploration, and allows models to continuously learn beyond initial plateaus. This method leads to higher-quality policy updates and improved overall model performance.

What results were observed when applying BroRL to a ProRLv2 model?

When BroRL was applied to a ProRLv2 model that had plateaued, it revived the model's performance, enabling continuous improvement and surpassing previous performance ceilings, as shown in comparative graphs.

Key Statistics & Figures

Math score

63.66

Achieved by the BroRL method on the Math benchmark.

Code score

56.64

Achieved by the BroRL method on the Code benchmark.

Reasoning Gym score

63.40

Achieved by the BroRL method on the Reasoning Gym benchmark.

Dynamic sampling pass rate

62%

Increased from 41% with ProRL to 62% with BroRL.

Generation throughput

72.4 samples/s

Improved from 36.5 samples/s with ProRL to 72.4 samples/s with BroRL.

Technologies & Tools

Hardware

Nvidia H100

Used for training the models in the experiments.

Key Actionable Insights

1
Implement BroRL to enhance the training of your LLMs by focusing on rollout scaling rather than just increasing training steps.
This approach can help you overcome performance plateaus and achieve better model performance more efficiently.

2
Utilize the findings from BroRL to optimize your reinforcement learning strategies, particularly in terms of exploration.
By understanding the importance of rollout scaling, you can improve the stability and quality of your model updates.

3
Leverage the state-of-the-art performance achieved by BroRL in reasoning tasks to benchmark your own models.
BroRL sets new standards in Math, Code, and Reasoning Gym benchmarks, providing a reference point for model evaluation.

Common Pitfalls

1

Relying solely on increasing training steps can lead to performance plateaus and diminishing returns.

This happens because the exploration strategy becomes insufficient, causing the model to stagnate. Instead, focusing on rollout scaling can provide a more stable learning signal.

Related Concepts

Reinforcement Learning

Large Language Models

Prolonged Reinforcement Learning

Dynamic Sampling