HuggingFace Tokenizers

Skill Verified Active

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Purpose

To offer exceptionally fast and versatile tokenization solutions for natural language processing tasks, enabling efficient processing of large text corpora and custom tokenizer development.

Features

High-performance tokenization (<20s per GB)
Support for BPE, WordPiece, Unigram algorithms
Custom vocabulary training
Token alignment tracking
Seamless integration with Transformers library

Use Cases

When you need extremely fast tokenization for large datasets.
When training custom tokenizers from scratch for specific domains.
When precise alignment tracking between tokens and original text is required.
When building production NLP pipelines requiring efficient text preprocessing.

Non-Goals

Replacing core NLP model architectures.
Performing tasks beyond tokenization and vocabulary management.
Providing a graphical user interface for training.

Workflow

Prepare training data (files, list, or iterator).
Initialize tokenizer with a chosen algorithm (BPE, WordPiece, Unigram).
Configure trainer with vocabulary size, frequency, and special tokens.
Train the tokenizer on the prepared data.
Add post-processing steps (e.g., special tokens).
Save the tokenizer and optionally convert for Transformers integration.

Practices

Algorithm selection
Pipeline configuration
Custom tokenizer training

Prerequisites

Python 3.7+
pip package manager
tokenizers library
transformers library (for integration)

Trust

info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed. This indicates a closure rate below 50% with a moderate number of open issues, suggesting slower response times.

Execution

info:Pinned dependenciesDependencies are listed but not explicitly pinned with lockfiles in the provided context, though standard Python installation methods are implied.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified

95 /100

Analyzed 1 day ago

Trust Signals

Last commit1 day ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

Sentencepiece

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill

davila7

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

Skill

raintree-technology

Transformers

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill

K-Dense-AI

Hf Cli

100

Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing models, datasets, spaces, buckets, repos, papers, jobs, and more on the Hugging Face Hub. Use when: handling authentication; managing local cache; managing Hugging Face Buckets; running or scheduling jobs on Hugging Face infrastructure; managing Hugging Face repos; discussions and pull requests; browsing models, datasets and spaces; reading, searching, or browsing academic papers; managing collections; querying datasets; configuring spaces; setting up webhooks; or deploying and managing HF Inference Endpoints. Make sure to use this skill whenever the user mentions 'hf', 'huggingface', 'Hugging Face', 'huggingface-cli', or 'hugging face cli', or wants to do anything related to the Hugging Face ecosystem and to AI and ML in general. Also use for cloud storage needs like training checkpoints, data pipelines, or agent traces. Use even if the user doesn't explicitly ask for a CLI command. Replaces the deprecated `huggingface-cli`.

Skill

huggingface

HuggingFace Tokenizers

Features

Use Cases

Non-Goals

Workflow

Practices

Prerequisites

Trust

Execution

Quality Score

Trust Signals

Similar Extensions

Sentencepiece

Sentencepiece