Sentencepiece
Skill Verifiziert AktivLanguage-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
To provide a robust and efficient language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for various multilingual and specialized language tasks.
Funktionen
- Language-independent tokenization of raw Unicode text
- Support for BPE and Unigram tokenization algorithms
- Fast processing (50k sentences/sec) and lightweight memory usage (~6MB)
- Deterministic vocabulary and reproducible tokenization
- Training and usage examples for multilingual, CJK, and other language needs
Anwendungsfälle
- Building multilingual NLP models
- Processing CJK languages (Chinese, Japanese, Korean)
- Ensuring reproducible tokenization for research and deployment
- Training models directly on raw text without pre-tokenization
- Lightweight deployment scenarios requiring minimal resources
Nicht-Ziele
- Providing a faster tokenizer than SentencePiece itself
- Replacing domain-specific tokenizers for highly specialized tasks
- Handling tasks beyond text tokenization and de-tokenization
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Sentencepiece
98Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.
HuggingFace Tokenizers
95Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
HuggingFace Tokenizers
98Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
LinkedIn Humanizer
100Schrubbt KI-Anzeichen aus jedem Textentwurf ODER prüft einen fertigen Beitrag anhand der Checkliste für heuristische Algorithmen von 2026. Umschreiber auf mehreren Ebenen (forensisch / streng / ästhetisch / alle) plus `--mode audit` für eine reine Erkennungsprüfung mit Bestehen/Nichtbestehen-Bewertung, die Länge, Aufhänger, Handlungsaufforderung, Formatstrafen und KI-Vokabular abdeckt. Unterwerkzeuge: Emoji-Mustererkennung, Tester für die Verteilung mehrerer Detektoren (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), Regelerklärer. Löst bei "humanisieren", "de-KI", "diesen Entwurf prüfen", "vor dem Posten prüfen", "ist das fertig" aus.
Convert Resume to Markdown
100Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.
Sentiment Analyzer
100Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets