跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Speculative Decoding

技能 已验证 活跃

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

目的

To significantly speed up LLM inference and reduce latency by employing cutting-edge techniques like speculative decoding, Medusa, and lookahead decoding.

功能

  • Accelerate LLM inference using speculative decoding
  • Implement Medusa for multi-head parallel prediction
  • Utilize Lookahead Decoding for Jacobi iteration-based speedups
  • Provide installation instructions for key libraries
  • Offer runnable code examples for each technique

使用场景

  • Optimizing inference speed for LLMs (1.5-3.6x speedup)
  • Reducing latency for real-time applications like chatbots
  • Deploying LLM models efficiently on hardware with limited compute
  • Generating tokens faster without sacrificing model quality

非目标

  • Training large language models from scratch
  • Fine-tuning models for specific downstream tasks beyond inference optimization
  • Providing a generic LLM serving framework without focus on acceleration techniques

Execution

  • info:Pinned dependenciesDependencies are listed, but specific pinning versions or lockfiles are not explicitly shown in the provided context.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Speculative Decoding

98

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

技能
Orchestra-Research

Agent Resource Allocator

98

Agent skill for resource-allocator - invoke with $agent-resource-allocator

技能
ruvnet

Game Developer

98

Use when building game systems, implementing Unity/Unreal Engine features, or optimizing game performance. Invoke to implement ECS architecture, configure physics systems and colliders, set up multiplayer networking with lag compensation, optimize frame rates to 60+ FPS targets, develop shaders, or apply game design patterns such as object pooling and state machines. Trigger keywords: Unity, Unreal Engine, game development, ECS architecture, game physics, multiplayer networking, game optimization, shader programming, game AI.

技能
jeffallan

Game Technical Director

98

Invoke when the user asks about game architecture, engine selection, performance budgets, technical debt, build pipeline, cross-platform, rendering pipeline, or CI/CD for games. Triggers on: "architecture", "engine selection", "performance budget", "tech debt", "build pipeline", "cross-platform", "rendering", "CI/CD". Do NOT invoke for creative vision (use game-creative-director) or engine-specific code (use engine specialists). Part of the AlterLab GameForge collection.

技能
AlterLab-IEU

Openrlhf Training

97

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能
davila7

V3 Integration Deep

95

Deep agentic-flow@alpha integration implementing ADR-001. Eliminates 10,000+ duplicate lines by building claude-flow as specialized extension rather than parallel implementation.

技能
ruvnet