Nemo Curator
Automate data curation for large language models using NeMo Curator to improve dataset quality and model performance
NeMo Curator is a community skill for building data curation pipelines using NVIDIA NeMo Curator, covering text filtering, deduplication, quality scoring, language identification, and dataset preparation for training large language models.
What Is This?
Overview
NeMo Curator provides tools for processing and curating large text datasets for language model training. It covers text filtering that removes low-quality documents based on configurable heuristic rules and classifier scores, deduplication that identifies and removes duplicate or near-duplicate content using MinHash and exact matching, quality scoring that rates documents by language model perplexity and statistical quality indicators, language identification that classifies document language to enable filtering by target languages, and dataset preparation that formats curated text into training-ready formats with proper tokenization. The skill enables teams to build clean training datasets at scale.
Who Should Use This
This skill serves ML engineers building training data pipelines for language models, data scientists curating web-scale text corpora, and research teams preparing high-quality datasets for model training experiments.
Why Use It?
Problems It Solves
Raw web-scraped text contains significant amounts of spam, boilerplate, and low-quality content that degrades model training. Manual curation does not scale to the billions of documents needed for large language model training. Duplicate content in training data wastes compute and can cause memorization artifacts. Mixed-language datasets require filtering to target specific languages.
Core Highlights
Text filter applies rule-based and classifier-driven quality filtering at scale. Deduplicator removes exact and near-duplicate documents using scalable hashing algorithms. Quality scorer rates documents using perplexity and statistical metrics. Language detector classifies and filters documents by language.
How to Use It?
Basic Usage
from nemo_curator import (
Sequential)
from nemo_curator\
.filters import (
WordCountFilter,
RepeatedParagraph\
Filter)
from nemo_curator\
.modules import (
ExactDuplicates,
ScoreFilter)
pipeline = Sequential([
ScoreFilter(
WordCountFilter(
min_words=50,
max_words=100000),
score_field=
'word_count',
score_type=int),
ScoreFilter(
RepeatedParagraph\
Filter(
max_repeated\
_ratio=0.3),
score_field=
'repeated_para',
score_type=float),
ExactDuplicates(
id_field='doc_id',
text_field='text',
hash_method='md5')
])
result = pipeline(
dataset)
print(
f'Input: '
f'{len(dataset)} docs')
print(
f'Output: '
f'{len(result)} docs')Real-World Examples
from nemo_curator\
.modules import (
FuzzyDuplicates)
from nemo_curator\
.filters import (
FastTextLangId)
class DataCurator:
def __init__(
self,
target_lang: str
= 'en'
):
self.lang = target_lang
self.lang_filter = (
FastTextLangId(
model_path=
'lid.176.bin'))
def filter_language(
self, dataset
):
scored = (
self.lang_filter(
dataset))
return scored.filter(
lambda row:
row['language']
== self.lang
and row['lang_score']
> 0.8)
def deduplicate(
self, dataset
):
dedup = FuzzyDuplicates(
id_field='doc_id',
text_field='text',
seed=42,
num_perm=128,
threshold=0.8)
return dedup(dataset)
def curate(
self, dataset
):
filtered = (
self.filter_language(
dataset))
deduped = (
self.deduplicate(
filtered))
return dedupedAdvanced Tips
Run deduplication before quality filtering to reduce the dataset size early in the pipeline and save compute on downstream processing steps. Tune the fuzzy duplicate similarity threshold based on your domain since technical documents tolerate lower thresholds than creative text. Use GPU-accelerated filtering when processing datasets with billions of documents.
When to Use It?
Use Cases
Build a curation pipeline that filters and deduplicates web-scraped text for language model pre-training. Remove low-quality and off-language documents from a multilingual corpus to create a clean English training set. Score and rank documents by quality to select the highest-quality subset for model training.
Related Topics
Data curation, text filtering, deduplication, language identification, NeMo, dataset preparation, and language model training data.
Important Notes
Requirements
NVIDIA NeMo Curator Python package with GPU support for accelerated processing. Dask or RAPIDS backend for distributed dataset processing. FastText language identification model for language filtering.
Usage Recommendations
Do: profile your raw dataset characteristics before designing filter rules to understand the quality distribution. Run each pipeline stage independently first to measure filtering ratios before combining into a full pipeline. Log statistics at each stage to track how many documents are removed and why.
Don't: apply aggressive filtering thresholds without checking whether they remove valid content from specific domains. Skip deduplication since even small amounts of duplicate training data can cause memorization problems. Process the entire dataset in memory when Dask lazy evaluation can handle out-of-core processing.
Limitations
Quality filtering heuristics may remove valid documents that have unusual formatting or domain-specific structure. Fuzzy deduplication accuracy depends on the number of hash permutations which trades compute cost for precision. Language identification accuracy decreases for short documents and code-mixed text that contains multiple languages.
More Skills You Might Like
Explore similar skills to enhance your workflow
Secure Code Guardian
Secure Code Guardian automation for protecting and reviewing code securely
Latchbio Integration
Automate and integrate LatchBio bioinformatics pipelines into your systems
Hypogenic
Streamline synthetic data generation and automated hypothesis testing for advanced research workflows
Env Secrets Manager
Automate and integrate Env Secrets Manager to securely handle environment credentials
Apilio Automation
Automate Apilio operations through Composio's Apilio toolkit via Rube MCP
Research Grants
Simplify the research grant application process by automating documentation and tracking funding opportunities