Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

HuggingFace Tokenizers

Skill Verifiziert Aktiv

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Zweck

To provide extremely fast and efficient tokenization for research and production NLP tasks, including custom tokenizer training and handling large datasets.

Funktionen

  • Rust-based implementation for high performance (<20s per GB)
  • Supports BPE, WordPiece, and Unigram algorithms
  • Enables training custom vocabularies
  • Provides alignment tracking (token to text position)
  • Seamless integration with HuggingFace Transformers

Anwendungsfälle

  • When you need high-performance tokenization for large corpora
  • When you need to train custom tokenizers with specific algorithms
  • When building production NLP pipelines requiring speed and efficiency
  • When tracking token alignments back to original text positions is necessary

Nicht-Ziele

  • Performing the entire NLP model training pipeline
  • Replacing the core functionality of HuggingFace Transformers models
  • Providing tokenization for unsupported languages without custom training

Workflow

  1. Initialize tokenizer with a chosen model (BPE, WordPiece, Unigram).
  2. Configure trainer with desired vocabulary size, special tokens, and frequency thresholds.
  3. Prepare training data as files, a list, or an iterator.
  4. Train the tokenizer on the data using the configured trainer.
  5. Add post-processing (e.g., special tokens) and save the tokenizer.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

TimesFM Forecasting

100

Zero-shot time series forecasting with Google's TimesFM foundation model. Use for any univariate time series (sales, sensors, energy, vitals, weather) without training a custom model. Supports CSV/DataFrame/array inputs with point forecasts and prediction intervals. Includes a preflight system checker script to verify RAM/GPU before first use.

Skill
K-Dense-AI

Cleanup Cycles

100

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

Skill
raintree-technology

Gtars

99

High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.

Skill
K-Dense-AI

Transformers.js

99

Use Transformers.js to run state-of-the-art machine learning models directly in JavaScript/TypeScript. Supports NLP (text classification, translation, summarization), computer vision (image classification, object detection), audio (speech recognition, audio classification), and multimodal tasks. Works in browsers and server-side runtimes (Node.js, Bun, Deno) with WebGPU/WASM using pre-trained models from Hugging Face Hub.

Skill
huggingface

Sentencepiece

99

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill
davila7

Transformers

98

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill
K-Dense-AI