Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Sentencepiece

Skill Verifiziert Aktiv

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Zweck

To provide a robust and efficient language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for various multilingual and specialized language tasks.

Funktionen

  • Language-independent tokenization of raw Unicode text
  • Support for BPE and Unigram tokenization algorithms
  • Fast processing (50k sentences/sec) and lightweight memory usage (~6MB)
  • Deterministic vocabulary and reproducible tokenization
  • Training and usage examples for multilingual, CJK, and other language needs

Anwendungsfälle

  • Building multilingual NLP models
  • Processing CJK languages (Chinese, Japanese, Korean)
  • Ensuring reproducible tokenization for research and deployment
  • Training models directly on raw text without pre-tokenization
  • Lightweight deployment scenarios requiring minimal resources

Nicht-Ziele

  • Providing a faster tokenizer than SentencePiece itself
  • Replacing domain-specific tokenizers for highly specialized tasks
  • Handling tasks beyond text tokenization and de-tokenization

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
99 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Sentencepiece

98

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill
Orchestra-Research

HuggingFace Tokenizers

95

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill
davila7

HuggingFace Tokenizers

98

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill
Orchestra-Research

LinkedIn Humanizer

100

Schrubbt KI-Anzeichen aus jedem Textentwurf ODER prüft einen fertigen Beitrag anhand der Checkliste für heuristische Algorithmen von 2026. Umschreiber auf mehreren Ebenen (forensisch / streng / ästhetisch / alle) plus `--mode audit` für eine reine Erkennungsprüfung mit Bestehen/Nichtbestehen-Bewertung, die Länge, Aufhänger, Handlungsaufforderung, Formatstrafen und KI-Vokabular abdeckt. Unterwerkzeuge: Emoji-Mustererkennung, Tester für die Verteilung mehrerer Detektoren (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), Regelerklärer. Löst bei "humanisieren", "de-KI", "diesen Entwurf prüfen", "vor dem Posten prüfen", "ist das fertig" aus.

Skill
sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

Skill
iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

Skill
guia-matthieu