Model Pruning
Skill Verified ActiveReduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.
Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT, enabling efficient deployment on constrained hardware and faster serving.
Features
- Reduces model size and accelerates inference via pruning
- Implements Wanda (weights x activations) pruning
- Supports SparseGPT (second-order pruning)
- Covers structured, unstructured, and N:M sparsity
- Enables compression without retraining (one-shot methods)
Use Cases
- Compressing models without retraining
- Achieving 50% sparsity with minimal accuracy loss
- Enabling faster inference on hardware accelerators
- Deploying LLMs on constrained hardware (mobile, edge)
Non-Goals
- Full retraining of pruned models
- Achieving speedups without hardware support for sparsity (for unstructured pruning)
- Exploring pruning methods beyond those listed (e.g., iterative pruning with fine-tuning in this specific skill's primary examples)
Execution
- info:Pinned dependenciesDependencies are listed but not explicitly pinned with versions or lockfiles in the SKILL.md, which could lead to potential compatibility issues.
Installation
First, add the marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQuality Score
VerifiedTrust Signals
Similar Extensions
Model Pruning
95Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.
Model Merging
98Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.
Speculative Decoding
98Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.
Outlines
98Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library
Tensorrt Llm
98Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Rwkv Architecture
96RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.