Gptq
Skill Verified ActivePost-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
To enable users to deploy large LLMs on consumer GPUs or achieve faster inference speeds by quantizing models to 4-bit using the GPTQ method with minimal accuracy degradation.
Features
- 4-bit post-training quantization for LLMs
- Minimal accuracy loss (<2% perplexity degradation)
- 4x memory reduction for large models
- 3-4x faster inference compared to FP16
- Integration with Transformers, PEFT, and vLLM
- Support for various kernel backends (ExLlamaV2, Marlin, Triton)
- Guidance on calibration data selection and quantization configuration
- Instructions for quantizing custom models
Use Cases
- Deploying large LLMs (70B, 405B) on memory-constrained consumer GPUs.
- Reducing memory usage of LLMs for faster loading and inference.
- Achieving significant inference speedups for real-time applications.
- Fine-tuning quantized models using QLoRA for memory efficiency.
Non-Goals
- Providing pre-quantized models directly (focus is on the method).
- Supporting quantization methods other than GPTQ.
- Quantization during training (focus is on post-training quantization).
- Optimizations for CPU-only inference (focus is on GPU acceleration).
Installation
First, add the marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQuality Score
VerifiedTrust Signals
Similar Extensions
Peft Fine Tuning
99Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
Vector Index Tuning
99Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.
Performance Analysis
100Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms
Oraclaw Solver
100Industrial-grade scheduling and resource optimization for AI agents. Solve task scheduling with energy matching, budget allocation, and any LP/MIP constraint problem in milliseconds.
Oraclaw Decide
100Decision intelligence for AI agents. Analyze options, map decision dependencies with PageRank, detect when information sources conflict, and find the choices that matter most.
MongoDB Connection Optimizer
100Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill when working/updating/reviewing on functions that instantiate or configure a MongoDB client (eg, when calling `connect()`), configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, creating long-running tasks and concurrency, or debugging connection-related failures.