Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

HuggingFace Tokenizers

Skill Verifiziert Aktiv

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Zweck

To offer exceptionally fast and versatile tokenization solutions for natural language processing tasks, enabling efficient processing of large text corpora and custom tokenizer development.

Funktionen

High-performance tokenization (<20s per GB)
Support for BPE, WordPiece, Unigram algorithms
Custom vocabulary training
Token alignment tracking
Seamless integration with Transformers library

Anwendungsfälle

When you need extremely fast tokenization for large datasets.
When training custom tokenizers from scratch for specific domains.
When precise alignment tracking between tokens and original text is required.
When building production NLP pipelines requiring efficient text preprocessing.

Nicht-Ziele

Replacing core NLP model architectures.
Performing tasks beyond tokenization and vocabulary management.
Providing a graphical user interface for training.

Workflow

Prepare training data (files, list, or iterator).
Initialize tokenizer with a chosen algorithm (BPE, WordPiece, Unigram).
Configure trainer with vocabulary size, frequency, and special tokens.
Train the tokenizer on the prepared data.
Add post-processing steps (e.g., special tokens).
Save the tokenizer and optionally convert for Transformers integration.

Praktiken

Algorithm selection
Pipeline configuration
Custom tokenizer training

Voraussetzungen

Python 3.7+
pip package manager
tokenizers library
transformers library (for integration)

Trust

info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed. This indicates a closure rate below 50% with a moderate number of open issues, suggesting slower response times.

Execution

info:Pinned dependenciesDependencies are listed but not explicitly pinned with lockfiles in the provided context, though standard Python installation methods are implied.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert

95 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago

GitHub-Inhaber davila7

Sterne27.2k

Downloads 23k

LizenzMIT

Websiteaitmpl.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

Sentencepiece

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill

davila7

Detect and untangle circular dependencies. Runs madge/skott (TS), pycycle (Py), or compiler-only checks (Go/Rust). Auto-fixes leaf-extractable cycles; reports core cycles for human review. Use when the user asks to find circular imports, fix dependency cycles, or untangle module graph. Example queries — "find circular imports", "fix dependency cycles", "untangle our module graph", "why is madge complaining".

Skill

raintree-technology

Transformers

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill

K-Dense-AI

Hf Cli

100

Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing models, datasets, spaces, buckets, repos, papers, jobs, and more on the Hugging Face Hub. Use when: handling authentication; managing local cache; managing Hugging Face Buckets; running or scheduling jobs on Hugging Face infrastructure; managing Hugging Face repos; discussions and pull requests; browsing models, datasets and spaces; reading, searching, or browsing academic papers; managing collections; querying datasets; configuring spaces; setting up webhooks; or deploying and managing HF Inference Endpoints. Make sure to use this skill whenever the user mentions 'hf', 'huggingface', 'Hugging Face', 'huggingface-cli', or 'hugging face cli', or wants to do anything related to the Hugging Face ecosystem and to AI and ML in general. Also use for cloud storage needs like training checkpoints, data pipelines, or agent traces. Use even if the user doesn't explicitly ask for a CLI command. Replaces the deprecated `huggingface-cli`.

Skill

huggingface

HuggingFace Tokenizers

Funktionen

Anwendungsfälle

Nicht-Ziele

Workflow

Praktiken

Voraussetzungen

Trust

Execution

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Sentencepiece

Sentencepiece