HuggingFace Tokenizers
Skill Verifiziert AktivFast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
To offer exceptionally fast and versatile tokenization solutions for natural language processing tasks, enabling efficient processing of large text corpora and custom tokenizer development.
Funktionen
- High-performance tokenization (<20s per GB)
- Support for BPE, WordPiece, Unigram algorithms
- Custom vocabulary training
- Token alignment tracking
- Seamless integration with Transformers library
Anwendungsfälle
- When you need extremely fast tokenization for large datasets.
- When training custom tokenizers from scratch for specific domains.
- When precise alignment tracking between tokens and original text is required.
- When building production NLP pipelines requiring efficient text preprocessing.
Nicht-Ziele
- Replacing core NLP model architectures.
- Performing tasks beyond tokenization and vocabulary management.
- Providing a graphical user interface for training.
Workflow
- Prepare training data (files, list, or iterator).
- Initialize tokenizer with a chosen algorithm (BPE, WordPiece, Unigram).
- Configure trainer with vocabulary size, frequency, and special tokens.
- Train the tokenizer on the prepared data.
- Add post-processing steps (e.g., special tokens).
- Save the tokenizer and optionally convert for Transformers integration.
Praktiken
- Algorithm selection
- Pipeline configuration
- Custom tokenizer training
Voraussetzungen
- Python 3.7+
- pip package manager
- tokenizers library
- transformers library (for integration)
Trust
- info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed. This indicates a closure rate below 50% with a moderate number of open issues, suggesting slower response times.
Execution
- info:Pinned dependenciesDependencies are listed but not explicitly pinned with lockfiles in the provided context, though standard Python installation methods are implied.
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Sentencepiece
99Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
Sentencepiece
98Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
HuggingFace Tokenizers
98Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
Cleanup Cycles
100Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".
Transformers
98This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.
Hf Cli
100Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing models, datasets, spaces, buckets, repos, papers, jobs, and more on the Hugging Face Hub. Use when: handling authentication; managing local cache; managing Hugging Face Buckets; running or scheduling jobs on Hugging Face infrastructure; managing Hugging Face repos; discussions and pull requests; browsing models, datasets and spaces; reading, searching, or browsing academic papers; managing collections; querying datasets; configuring spaces; setting up webhooks; or deploying and managing HF Inference Endpoints. Make sure to use this skill whenever the user mentions 'hf', 'huggingface', 'Hugging Face', 'huggingface-cli', or 'hugging face cli', or wants to do anything related to the Hugging Face ecosystem and to AI and ML in general. Also use for cloud storage needs like training checkpoints, data pipelines, or agent traces. Use even if the user doesn't explicitly ask for a CLI command. Replaces the deprecated `huggingface-cli`.