跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Nemo Curator

技能 已验证 活跃

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

目的

To enable efficient and high-quality data preparation for LLM training by leveraging GPU acceleration for complex curation tasks.

功能

  • GPU-accelerated data curation
  • Support for text, image, video, and audio
  • Fuzzy deduplication (16x faster)
  • Quality filtering (30+ heuristics)
  • PII redaction and NSFW detection

使用场景

  • Preparing LLM training data from web scrapes
  • Curating multi-modal datasets
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU clusters

非目标

  • CPU-based data processing
  • Basic data cleaning without advanced curation features
  • Use cases outside of LLM training data preparation

Scope

  • info:Tool surface sizeThe skill exposes numerous filters, modules, and classifiers which, while extensive, are organized within a library structure rather than a flat list of tools.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Nemo Curator

98

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

技能
davila7

Create Spatial Visualization

100

Create interactive maps, elevation profiles, and spatial visualizations from GPX tracks, waypoints, or route data using R (sf, leaflet, tmap) or Observable (D3, deck.gl). Covers data import, coordinate system handling, map styling, and export to HTML or image formats. Use when visualizing a planned or completed tour route on an interactive map, creating elevation profiles for hiking or cycling routes, overlaying waypoints and POIs on a basemap, or building a web-based trip dashboard.

技能
pjt222

PyTDC (Therapeutics Data Commons)

99

Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction.

技能
K-Dense-AI

Pysam

99

Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines.

技能
K-Dense-AI

Polars Bio

99

High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.

技能
K-Dense-AI

Polars

99

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

技能
K-Dense-AI