HuggingFace Tokenizers
技能 已验证 活跃Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
To provide extremely fast and efficient tokenization for research and production NLP tasks, including custom tokenizer training and handling large datasets.
功能
- Rust-based implementation for high performance (<20s per GB)
- Supports BPE, WordPiece, and Unigram algorithms
- Enables training custom vocabularies
- Provides alignment tracking (token to text position)
- Seamless integration with HuggingFace Transformers
使用场景
- When you need high-performance tokenization for large corpora
- When you need to train custom tokenizers with specific algorithms
- When building production NLP pipelines requiring speed and efficiency
- When tracking token alignments back to original text positions is necessary
非目标
- Performing the entire NLP model training pipeline
- Replacing the core functionality of HuggingFace Transformers models
- Providing tokenization for unsupported languages without custom training
工作流
- Initialize tokenizer with a chosen model (BPE, WordPiece, Unigram).
- Configure trainer with desired vocabulary size, special tokens, and frequency thresholds.
- Prepare training data as files, a list, or an iterator.
- Train the tokenizer on the data using the configured trainer.
- Add post-processing (e.g., special tokens) and save the tokenizer.
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
已验证类似扩展
TimesFM Forecasting
100Zero-shot time series forecasting with Google's TimesFM foundation model. Use for any univariate time series (sales, sensors, energy, vitals, weather) without training a custom model. Supports CSV/DataFrame/array inputs with point forecasts and prediction intervals. Includes a preflight system checker script to verify RAM/GPU before first use.
Cleanup Cycles
100Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".
Gtars
99High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.
Transformers.js
99Use Transformers.js to run state-of-the-art machine learning models directly in JavaScript/TypeScript. Supports NLP (text classification, translation, summarization), computer vision (image classification, object detection), audio (speech recognition, audio classification), and multimodal tasks. Works in browsers and server-side runtimes (Node.js, Bun, Deno) with WebGPU/WASM using pre-trained models from Hugging Face Hub.
Sentencepiece
99Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
Transformers
98This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.