Building Nemotron&#x2d;CC, A High&#x2d;Quality Trillion Token Dataset for LLM Pretraining from Common Crawl

Nirmal Kumar Juluru

Curating high-quality pretraining datasets is critical for enterprise developers aiming to train state-of-the-art large language models (LLMs).

NVIDIA

•

Nirmal Kumar Juluru

•7 min read•advanced•

--

•View Original

DaskHTML

Overview

The article discusses the development of the Nemotron-CC dataset, a high-quality trillion-token dataset for pretraining large language models (LLMs) using Common Crawl data. It highlights the innovative data curation pipeline integrated into NVIDIA's NeMo Curator, which balances data quality and quantity through advanced techniques like classifier ensembling and synthetic data generation.

What You'll Learn

1

How to use the Nemotron-CC pipeline to generate high-quality datasets for LLM pretraining

2

Why synthetic data generation is crucial for enhancing dataset quality

3

How to implement quality labeling using ensemble classifiers in data curation

Prerequisites & Requirements

Understanding of large language models and data curation techniques
Familiarity with NVIDIA NeMo Curator and its APIs(optional)

Key Questions Answered

What is the Nemotron-CC dataset and its significance?

The Nemotron-CC dataset is a high-quality trillion-token dataset derived from Common Crawl data, designed for pretraining large language models. It aims to improve model accuracy by providing a refined corpus that balances data quality and quantity, addressing limitations of traditional data curation methods.

How does the Nemotron-CC pipeline improve data quality?

The Nemotron-CC pipeline enhances data quality through a combination of perplexity scoring, ensemble quality labeling, and synthetic data generation. This approach allows for the retention of valuable information that traditional filtering methods might discard, resulting in a more robust dataset for training LLMs.

What are the results of training Llama 3.1 with Nemotron-CC data?

Training the Llama 3.1 model on a 1 trillion token subset of the Nemotron-CC dataset improved the MMLU score by 5.6 points compared to training on the DCLM dataset. Additionally, using 15 trillion tokens, including 7.2 trillion from Nemotron-CC, boosted the MMLU score to 70.3, a 5-point increase over the original Llama 3.1 score.

Key Statistics & Figures

Total tokens in Nemotron-CC dataset

6.3 trillion

This dataset is designed for pretraining large language models.

Improvement in MMLU score

5.6 points

This score improvement was observed when training Llama 3.1 on a 1 trillion token subset of Nemotron-CC compared to DCLM.

MMLU score after training with Nemotron-CC

70.3

This score was achieved when training Llama 3.1 with 15 trillion tokens, including 7.2 trillion from the Nemotron-CC dataset.

Technologies & Tools

Backend

Nvidia Nemo Curator

Used for curating and processing datasets for LLM pretraining.

Tools

Fasttext

Utilized for language identification in the data curation pipeline.

Tools

Kenlm

Employed for generating perplexity scores to filter documents.

Key Actionable Insights

1
Leverage the Nemotron-CC pipeline to enhance your LLM training datasets by integrating synthetic data generation techniques.
This approach can significantly improve the quality of your datasets, especially when dealing with large-scale pretraining tasks that require diverse and high-quality data.

2
Utilize the ensemble quality labeling method to effectively categorize and prioritize your training data.
By implementing this method, you can ensure that your models are trained on the highest quality data, which is crucial for achieving better performance on complex reasoning tasks.

3
Explore the use of perplexity scoring to filter out low-quality text from your datasets.
This technique helps maintain a high standard for the data used in training, which can lead to improved model accuracy and reliability.

Common Pitfalls

1

Relying solely on heuristic filtering methods can lead to the loss of valuable data.

Traditional filtering often discards low-quality text that may still contain useful information, which can hinder model performance on complex tasks.

Related Concepts

Data Curation Techniques

Synthetic Data Generation

Quality Assessment In Datasets