此内容尚未提供您的语言版本,正在以英文显示。

Sentencepiece

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

目的

To provide a fast, lightweight, and language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for multilingual and CJK language processing.

功能

Language-independent tokenization of raw Unicode text
Support for BPE and Unigram tokenization algorithms
Fast (50k sentences/sec) and lightweight (6MB memory) performance
Deterministic vocabulary for reproducible tokenization
Examples for training, encoding, and decoding

使用场景

Building multilingual NLP models
Working with CJK languages
Ensuring reproducible tokenization across different environments
Training models directly on raw text without pre-tokenization

非目标

Providing a tokenizer for English-centric tasks specifically (though it can be used)
Acting as a wrapper for other tokenization libraries like HuggingFace Tokenizers or tiktoken
Offering complex pre-processing steps beyond basic Unicode normalization

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Sentencepiece

技能

davila7

HuggingFace Tokenizers

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

技能

davila7

HuggingFace Tokenizers

技能

Orchestra-Research

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

技能

sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

技能

iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

技能

guia-matthieu