此内容尚未提供您的语言版本,正在以英文显示。

Sentencepiece

技能已验证活跃

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

目的

To provide a robust and efficient language-independent tokenizer for raw Unicode text, supporting BPE and Unigram algorithms for various multilingual and specialized language tasks.

功能

Language-independent tokenization of raw Unicode text
Support for BPE and Unigram tokenization algorithms
Fast processing (50k sentences/sec) and lightweight memory usage (~6MB)
Deterministic vocabulary and reproducible tokenization
Training and usage examples for multilingual, CJK, and other language needs

使用场景

Building multilingual NLP models
Processing CJK languages (Chinese, Japanese, Korean)
Ensuring reproducible tokenization for research and deployment
Training models directly on raw text without pre-tokenization
Lightweight deployment scenarios requiring minimal resources

非目标

Providing a faster tokenizer than SentencePiece itself
Replacing domain-specific tokenizers for highly specialized tasks
Handling tasks beyond text tokenization and de-tokenization

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证

99 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

Sentencepiece

技能

Orchestra-Research

HuggingFace Tokenizers

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

技能

davila7

HuggingFace Tokenizers

技能

Orchestra-Research

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

技能

sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

技能

iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

技能

guia-matthieu