此内容尚未提供您的语言版本,正在以英文显示。

HuggingFace Tokenizers

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

目的

To provide extremely fast and efficient tokenization for research and production NLP tasks, including custom tokenizer training and handling large datasets.

功能

Rust-based implementation for high performance (<20s per GB)
Supports BPE, WordPiece, and Unigram algorithms
Enables training custom vocabularies
Provides alignment tracking (token to text position)
Seamless integration with HuggingFace Transformers

使用场景

When you need high-performance tokenization for large corpora
When you need to train custom tokenizers with specific algorithms
When building production NLP pipelines requiring speed and efficiency
When tracking token alignments back to original text positions is necessary

非目标

Performing the entire NLP model training pipeline
Replacing the core functionality of HuggingFace Transformers models
Providing tokenization for unsupported languages without custom training

工作流

Initialize tokenizer with a chosen model (BPE, WordPiece, Unigram).
Configure trainer with desired vocabulary size, special tokens, and frequency thresholds.
Prepare training data as files, a list, or an iterator.
Train the tokenizer on the data using the configured trainer.
Add post-processing (e.g., special tokens) and save the tokenizer.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

TimesFM Forecasting

100

Zero-shot time series forecasting with Google's TimesFM foundation model. Use for any univariate time series (sales, sensors, energy, vitals, weather) without training a custom model. Supports CSV/DataFrame/array inputs with point forecasts and prediction intervals. Includes a preflight system checker script to verify RAM/GPU before first use.

技能

K-Dense-AI

Cleanup Cycles

100

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

技能

raintree-technology

Gtars

High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.

技能

K-Dense-AI

Transformers.js

Use Transformers.js to run state-of-the-art machine learning models directly in JavaScript/TypeScript. Supports NLP (text classification, translation, summarization), computer vision (image classification, object detection), audio (speech recognition, audio classification), and multimodal tasks. Works in browsers and server-side runtimes (Node.js, Bun, Deno) with WebGPU/WASM using pre-trained models from Hugging Face Hub.

技能

huggingface

Sentencepiece

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

技能

davila7

Transformers

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

技能

K-Dense-AI