Skip to main content

Model Pruning

Skill Verified Active

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

Purpose

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT, enabling efficient deployment on constrained hardware and faster serving.

Features

  • Reduces model size and accelerates inference via pruning
  • Implements Wanda (weights x activations) pruning
  • Supports SparseGPT (second-order pruning)
  • Covers structured, unstructured, and N:M sparsity
  • Enables compression without retraining (one-shot methods)

Use Cases

  • Compressing models without retraining
  • Achieving 50% sparsity with minimal accuracy loss
  • Enabling faster inference on hardware accelerators
  • Deploying LLMs on constrained hardware (mobile, edge)

Non-Goals

  • Full retraining of pruned models
  • Achieving speedups without hardware support for sparsity (for unstructured pruning)
  • Exploring pruning methods beyond those listed (e.g., iterative pruning with fine-tuning in this specific skill's primary examples)

Execution

  • info:Pinned dependenciesDependencies are listed but not explicitly pinned with versions or lockfiles in the SKILL.md, which could lead to potential compatibility issues.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
98 /100
Analyzed 1 day ago

Trust Signals

Last commit17 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Model Pruning

95

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

Skill
davila7

Model Merging

98

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

Skill
Orchestra-Research

Speculative Decoding

98

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

Skill
Orchestra-Research

Outlines

98

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

Skill
Orchestra-Research

Tensorrt Llm

98

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Skill
Orchestra-Research

Rwkv Architecture

96

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

Skill
davila7

© 2025 SkillRepo · Find the right skill, skip the noise.