Sentencepiece

Skill Verified Active

Part of:Agent Native Research Artifact (ARA) Tooling

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Purpose

To provide a fast, lightweight, and language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for multilingual and CJK language processing.

Features

Language-independent tokenization of raw Unicode text
Support for BPE and Unigram tokenization algorithms
Fast (50k sentences/sec) and lightweight (6MB memory) performance
Deterministic vocabulary for reproducible tokenization
Examples for training, encoding, and decoding

Use Cases

Building multilingual NLP models
Working with CJK languages
Ensuring reproducible tokenization across different environments
Training models directly on raw text without pre-tokenization

Non-Goals

Providing a tokenizer for English-centric tasks specifically (though it can be used)
Acting as a wrapper for other tokenization libraries like HuggingFace Tokenizers or tiktoken
Offering complex pre-processing steps beyond basic Unicode normalization

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified

98 /100

Analyzed 1 day ago

Trust Signals

Last commit17 days ago

GitHub owner Orchestra-Research

Stars8.3k

Downloads 0

LicenseMIT

Websiteorchestra-research.com

Status

View Source

Similar Extensions

Sentencepiece

Skill

davila7

HuggingFace Tokenizers

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Skill

davila7

HuggingFace Tokenizers

Skill

Orchestra-Research

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

Skill

sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

Skill

iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

Skill

guia-matthieu