跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

HuggingFace Tokenizers

技能 已验证 活跃

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

目的

To provide extremely fast and efficient tokenization for research and production NLP tasks, including custom tokenizer training and handling large datasets.

功能

  • Rust-based implementation for high performance (<20s per GB)
  • Supports BPE, WordPiece, and Unigram algorithms
  • Enables training custom vocabularies
  • Provides alignment tracking (token to text position)
  • Seamless integration with HuggingFace Transformers

使用场景

  • When you need high-performance tokenization for large corpora
  • When you need to train custom tokenizers with specific algorithms
  • When building production NLP pipelines requiring speed and efficiency
  • When tracking token alignments back to original text positions is necessary

非目标

  • Performing the entire NLP model training pipeline
  • Replacing the core functionality of HuggingFace Transformers models
  • Providing tokenization for unsupported languages without custom training

工作流

  1. Initialize tokenizer with a chosen model (BPE, WordPiece, Unigram).
  2. Configure trainer with desired vocabulary size, special tokens, and frequency thresholds.
  3. Prepare training data as files, a list, or an iterator.
  4. Train the tokenizer on the data using the configured trainer.
  5. Add post-processing (e.g., special tokens) and save the tokenizer.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

TimesFM Forecasting

100

Zero-shot time series forecasting with Google's TimesFM foundation model. Use for any univariate time series (sales, sensors, energy, vitals, weather) without training a custom model. Supports CSV/DataFrame/array inputs with point forecasts and prediction intervals. Includes a preflight system checker script to verify RAM/GPU before first use.

技能
K-Dense-AI

Cleanup Cycles

100

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

技能
raintree-technology

Gtars

99

High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.

技能
K-Dense-AI

Transformers.js

99

Use Transformers.js to run state-of-the-art machine learning models directly in JavaScript/TypeScript. Supports NLP (text classification, translation, summarization), computer vision (image classification, object detection), audio (speech recognition, audio classification), and multimodal tasks. Works in browsers and server-side runtimes (Node.js, Bun, Deno) with WebGPU/WASM using pre-trained models from Hugging Face Hub.

技能
huggingface

Sentencepiece

99

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

技能
davila7

Transformers

98

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

技能
K-Dense-AI