Nemo Curator

Skill Verified Active

Part of:Agent Native Research Artifact (ARA) Tooling

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

Purpose

To enable efficient and high-quality data preparation for LLM training by leveraging GPU acceleration for complex curation tasks.

Features

GPU-accelerated data curation
Support for text, image, video, and audio
Fuzzy deduplication (16x faster)
Quality filtering (30+ heuristics)
PII redaction and NSFW detection

Use Cases

Preparing LLM training data from web scrapes
Curating multi-modal datasets
Filtering low-quality or toxic content
Scaling data processing across GPU clusters

Non-Goals

CPU-based data processing
Basic data cleaning without advanced curation features
Use cases outside of LLM training data preparation

Scope

info:Tool surface sizeThe skill exposes numerous filters, modules, and classifiers which, while extensive, are organized within a library structure rather than a flat list of tools.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified

95 /100

Analyzed about 22 hours ago

Trust Signals

Last commit16 days ago

GitHub owner Orchestra-Research

Stars8.3k

Downloads 0

LicenseMIT

Websiteorchestra-research.com

Status

View Source

Similar Extensions

Nemo Curator

Skill

davila7

Create Spatial Visualization

100

Create interactive maps, elevation profiles, and spatial visualizations from GPX tracks, waypoints, or route data using R (sf, leaflet, tmap) or Observable (D3, deck.gl). Covers data import, coordinate system handling, map styling, and export to HTML or image formats. Use when visualizing a planned or completed tour route on an interactive map, creating elevation profiles for hiking or cycling routes, overlaying waypoints and POIs on a basemap, or building a web-based trip dashboard.

Skill

pjt222

PyTDC (Therapeutics Data Commons)

Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction.

Skill

K-Dense-AI

Pysam

Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines.

Skill

K-Dense-AI

Polars Bio

High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.

Skill

K-Dense-AI

Polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

Skill

K-Dense-AI