Skip to main content

HuggingFace Tokenizers

Skill Verified Active

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Purpose

To offer exceptionally fast and versatile tokenization solutions for natural language processing tasks, enabling efficient processing of large text corpora and custom tokenizer development.

Features

  • High-performance tokenization (<20s per GB)
  • Support for BPE, WordPiece, Unigram algorithms
  • Custom vocabulary training
  • Token alignment tracking
  • Seamless integration with Transformers library

Use Cases

  • When you need extremely fast tokenization for large datasets.
  • When training custom tokenizers from scratch for specific domains.
  • When precise alignment tracking between tokens and original text is required.
  • When building production NLP pipelines requiring efficient text preprocessing.

Non-Goals

  • Replacing core NLP model architectures.
  • Performing tasks beyond tokenization and vocabulary management.
  • Providing a graphical user interface for training.

Workflow

  1. Prepare training data (files, list, or iterator).
  2. Initialize tokenizer with a chosen algorithm (BPE, WordPiece, Unigram).
  3. Configure trainer with vocabulary size, frequency, and special tokens.
  4. Train the tokenizer on the prepared data.
  5. Add post-processing steps (e.g., special tokens).
  6. Save the tokenizer and optionally convert for Transformers integration.

Practices

  • Algorithm selection
  • Pipeline configuration
  • Custom tokenizer training

Prerequisites

  • Python 3.7+
  • pip package manager
  • tokenizers library
  • transformers library (for integration)

Trust

  • info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed. This indicates a closure rate below 50% with a moderate number of open issues, suggesting slower response times.

Execution

  • info:Pinned dependenciesDependencies are listed but not explicitly pinned with lockfiles in the provided context, though standard Python installation methods are implied.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified
95 /100
Analyzed 1 day ago

Trust Signals

Last commit1 day ago
Stars27.2k
LicenseMIT
Status
View Source

Similar Extensions

Sentencepiece

99

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill
davila7

Sentencepiece

98

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill
Orchestra-Research

HuggingFace Tokenizers

98

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill
Orchestra-Research

Cleanup Cycles

100

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

Skill
raintree-technology

Transformers

98

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill
K-Dense-AI

Hf Cli

100

Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing models, datasets, spaces, buckets, repos, papers, jobs, and more on the Hugging Face Hub. Use when: handling authentication; managing local cache; managing Hugging Face Buckets; running or scheduling jobs on Hugging Face infrastructure; managing Hugging Face repos; discussions and pull requests; browsing models, datasets and spaces; reading, searching, or browsing academic papers; managing collections; querying datasets; configuring spaces; setting up webhooks; or deploying and managing HF Inference Endpoints. Make sure to use this skill whenever the user mentions 'hf', 'huggingface', 'Hugging Face', 'huggingface-cli', or 'hugging face cli', or wants to do anything related to the Hugging Face ecosystem and to AI and ML in general. Also use for cloud storage needs like training checkpoints, data pipelines, or agent traces. Use even if the user doesn't explicitly ask for a CLI command. Replaces the deprecated `huggingface-cli`.

Skill
huggingface

© 2025 SkillRepo · Find the right skill, skip the noise.