此内容尚未提供您的语言版本,正在以英文显示。

Model Pruning

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

目的

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT, enabling efficient deployment on constrained hardware and faster serving.

功能

Reduces model size and accelerates inference via pruning
Implements Wanda (weights x activations) pruning
Supports SparseGPT (second-order pruning)
Covers structured, unstructured, and N:M sparsity
Enables compression without retraining (one-shot methods)

使用场景

Compressing models without retraining
Achieving 50% sparsity with minimal accuracy loss
Enabling faster inference on hardware accelerators
Deploying LLMs on constrained hardware (mobile, edge)

非目标

Full retraining of pruned models
Achieving speedups without hardware support for sparsity (for unstructured pruning)
Exploring pruning methods beyond those listed (e.g., iterative pruning with fine-tuning in this specific skill's primary examples)

Execution

info:Pinned dependenciesDependencies are listed but not explicitly pinned with versions or lockfiles in the SKILL.md, which could lead to potential compatibility issues.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Model Pruning

技能

davila7

Model Merging

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

技能

Orchestra-Research

Speculative Decoding

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

技能

Orchestra-Research

Outlines

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

技能

Orchestra-Research

Tensorrt Llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

技能

Orchestra-Research

Rwkv Architecture

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

技能

davila7