跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Sentencepiece

技能 已验证 活跃

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

目的

To provide a robust and efficient language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for various multilingual and specialized language tasks.

功能

  • Language-independent tokenization of raw Unicode text
  • Support for BPE and Unigram tokenization algorithms
  • Fast processing (50k sentences/sec) and lightweight memory usage (~6MB)
  • Deterministic vocabulary and reproducible tokenization
  • Training and usage examples for multilingual, CJK, and other language needs

使用场景

  • Building multilingual NLP models
  • Processing CJK languages (Chinese, Japanese, Korean)
  • Ensuring reproducible tokenization for research and deployment
  • Training models directly on raw text without pre-tokenization
  • Lightweight deployment scenarios requiring minimal resources

非目标

  • Providing a faster tokenizer than SentencePiece itself
  • Replacing domain-specific tokenizers for highly specialized tasks
  • Handling tasks beyond text tokenization and de-tokenization

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证
99 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Sentencepiece

98

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

技能
Orchestra-Research

HuggingFace Tokenizers

95

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

技能
davila7

HuggingFace Tokenizers

98

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

技能
Orchestra-Research

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

技能
sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

技能
iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

技能
guia-matthieu