Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

HuggingFace Tokenizers

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Zweck

To provide extremely fast and efficient tokenization for research and production NLP tasks, including custom tokenizer training and handling large datasets.

Funktionen

Rust-based implementation for high performance (<20s per GB)
Supports BPE, WordPiece, and Unigram algorithms
Enables training custom vocabularies
Provides alignment tracking (token to text position)
Seamless integration with HuggingFace Transformers

Anwendungsfälle

When you need high-performance tokenization for large corpora
When you need to train custom tokenizers with specific algorithms
When building production NLP pipelines requiring speed and efficiency
When tracking token alignments back to original text positions is necessary

Nicht-Ziele

Performing the entire NLP model training pipeline
Replacing the core functionality of HuggingFace Transformers models
Providing tokenization for unsupported languages without custom training

Workflow

Initialize tokenizer with a chosen model (BPE, WordPiece, Unigram).
Configure trainer with desired vocabulary size, special tokens, and frequency thresholds.
Prepare training data as files, a list, or an iterator.
Train the tokenizer on the data using the configured trainer.
Add post-processing (e.g., special tokens) and save the tokenizer.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

TimesFM Forecasting

100

Zero-shot time series forecasting with Google's TimesFM foundation model. Use for any univariate time series (sales, sensors, energy, vitals, weather) without training a custom model. Supports CSV/DataFrame/array inputs with point forecasts and prediction intervals. Includes a preflight system checker script to verify RAM/GPU before first use.

Skill

K-Dense-AI

Cleanup Cycles

100

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

Skill

raintree-technology

Gtars

High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.

Skill

K-Dense-AI

Transformers.js

Use Transformers.js to run state-of-the-art machine learning models directly in JavaScript/TypeScript. Supports NLP (text classification, translation, summarization), computer vision (image classification, object detection), audio (speech recognition, audio classification), and multimodal tasks. Works in browsers and server-side runtimes (Node.js, Bun, Deno) with WebGPU/WASM using pre-trained models from Hugging Face Hub.

Skill

huggingface

Sentencepiece

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill

davila7

Transformers

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill

K-Dense-AI