Model Pruning

Skill Verified Active

Part of:Agent Native Research Artifact (ARA) Tooling

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

Purpose

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT, enabling efficient deployment on constrained hardware and faster serving.

Features

Reduces model size and accelerates inference via pruning
Implements Wanda (weights x activations) pruning
Supports SparseGPT (second-order pruning)
Covers structured, unstructured, and N:M sparsity
Enables compression without retraining (one-shot methods)

Use Cases

Compressing models without retraining
Achieving 50% sparsity with minimal accuracy loss
Enabling faster inference on hardware accelerators
Deploying LLMs on constrained hardware (mobile, edge)

Non-Goals

Full retraining of pruned models
Achieving speedups without hardware support for sparsity (for unstructured pruning)
Exploring pruning methods beyond those listed (e.g., iterative pruning with fine-tuning in this specific skill's primary examples)

Execution

info:Pinned dependenciesDependencies are listed but not explicitly pinned with versions or lockfiles in the SKILL.md, which could lead to potential compatibility issues.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified

98 /100

Analyzed 1 day ago

Trust Signals

Last commit17 days ago

GitHub owner Orchestra-Research

Stars8.3k

Downloads 0

LicenseMIT

Websiteorchestra-research.com

Status

View Source

Similar Extensions

Model Pruning

Skill

davila7

Model Merging

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

Skill

Orchestra-Research

Speculative Decoding

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

Skill

Orchestra-Research

Outlines

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

Skill

Orchestra-Research

Tensorrt Llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Skill

Orchestra-Research

Rwkv Architecture

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

Skill

davila7