跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Model Pruning

技能 已验证 活跃

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

目的

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT, enabling efficient deployment on constrained hardware and faster serving.

功能

  • Reduces model size and accelerates inference via pruning
  • Implements Wanda (weights x activations) pruning
  • Supports SparseGPT (second-order pruning)
  • Covers structured, unstructured, and N:M sparsity
  • Enables compression without retraining (one-shot methods)

使用场景

  • Compressing models without retraining
  • Achieving 50% sparsity with minimal accuracy loss
  • Enabling faster inference on hardware accelerators
  • Deploying LLMs on constrained hardware (mobile, edge)

非目标

  • Full retraining of pruned models
  • Achieving speedups without hardware support for sparsity (for unstructured pruning)
  • Exploring pruning methods beyond those listed (e.g., iterative pruning with fine-tuning in this specific skill's primary examples)

Execution

  • info:Pinned dependenciesDependencies are listed but not explicitly pinned with versions or lockfiles in the SKILL.md, which could lead to potential compatibility issues.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Model Pruning

95

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

技能
davila7

Model Merging

98

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

技能
Orchestra-Research

Speculative Decoding

98

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

技能
Orchestra-Research

Outlines

98

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

技能
Orchestra-Research

Tensorrt Llm

98

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

技能
Orchestra-Research

Rwkv Architecture

96

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

技能
davila7