Skip to main content

Nemo Curator

Skill Verified Active

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

Purpose

To enable efficient and high-quality data preparation for LLM training by leveraging GPU acceleration for complex curation tasks.

Features

  • GPU-accelerated data curation
  • Support for text, image, video, and audio
  • Fuzzy deduplication (16x faster)
  • Quality filtering (30+ heuristics)
  • PII redaction and NSFW detection

Use Cases

  • Preparing LLM training data from web scrapes
  • Curating multi-modal datasets
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU clusters

Non-Goals

  • CPU-based data processing
  • Basic data cleaning without advanced curation features
  • Use cases outside of LLM training data preparation

Scope

  • info:Tool surface sizeThe skill exposes numerous filters, modules, and classifiers which, while extensive, are organized within a library structure rather than a flat list of tools.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
95 /100
Analyzed about 22 hours ago

Trust Signals

Last commit16 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Nemo Curator

98

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

Skill
davila7

Create Spatial Visualization

100

Create interactive maps, elevation profiles, and spatial visualizations from GPX tracks, waypoints, or route data using R (sf, leaflet, tmap) or Observable (D3, deck.gl). Covers data import, coordinate system handling, map styling, and export to HTML or image formats. Use when visualizing a planned or completed tour route on an interactive map, creating elevation profiles for hiking or cycling routes, overlaying waypoints and POIs on a basemap, or building a web-based trip dashboard.

Skill
pjt222

PyTDC (Therapeutics Data Commons)

99

Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction.

Skill
K-Dense-AI

Pysam

99

Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines.

Skill
K-Dense-AI

Polars Bio

99

High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.

Skill
K-Dense-AI

Polars

99

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

Skill
K-Dense-AI

© 2025 SkillRepo · Find the right skill, skip the noise.