Skip to main content

Sentencepiece

Skill Verified Active

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Purpose

To provide a fast, lightweight, and language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for multilingual and CJK language processing.

Features

  • Language-independent tokenization of raw Unicode text
  • Support for BPE and Unigram tokenization algorithms
  • Fast (50k sentences/sec) and lightweight (6MB memory) performance
  • Deterministic vocabulary for reproducible tokenization
  • Examples for training, encoding, and decoding

Use Cases

  • Building multilingual NLP models
  • Working with CJK languages
  • Ensuring reproducible tokenization across different environments
  • Training models directly on raw text without pre-tokenization

Non-Goals

  • Providing a tokenizer for English-centric tasks specifically (though it can be used)
  • Acting as a wrapper for other tokenization libraries like HuggingFace Tokenizers or tiktoken
  • Offering complex pre-processing steps beyond basic Unicode normalization

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
98 /100
Analyzed 1 day ago

Trust Signals

Last commit17 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Sentencepiece

99

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Skill
davila7

HuggingFace Tokenizers

95

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill
davila7

HuggingFace Tokenizers

98

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill
Orchestra-Research

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

Skill
sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

Skill
iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

Skill
guia-matthieu

© 2025 SkillRepo · Find the right skill, skip the noise.