Gptq
技能 已验证 活跃Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
To enable users to deploy large LLMs on consumer GPUs or achieve faster inference speeds by quantizing models to 4-bit using the GPTQ method with minimal accuracy degradation.
功能
- 4-bit post-training quantization for LLMs
- Minimal accuracy loss (<2% perplexity degradation)
- 4x memory reduction for large models
- 3-4x faster inference compared to FP16
- Integration with Transformers, PEFT, and vLLM
- Support for various kernel backends (ExLlamaV2, Marlin, Triton)
- Guidance on calibration data selection and quantization configuration
- Instructions for quantizing custom models
使用场景
- Deploying large LLMs (70B, 405B) on memory-constrained consumer GPUs.
- Reducing memory usage of LLMs for faster loading and inference.
- Achieving significant inference speedups for real-time applications.
- Fine-tuning quantized models using QLoRA for memory efficiency.
非目标
- Providing pre-quantized models directly (focus is on the method).
- Supporting quantization methods other than GPTQ.
- Quantization during training (focus is on post-training quantization).
- Optimizations for CPU-only inference (focus is on GPU acceleration).
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
已验证类似扩展
Peft Fine Tuning
99Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
Vector Index Tuning
99Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.
Performance Analysis
100Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms
Oraclaw Solver
100为 AI 代理提供工业级的调度和资源优化。在几毫秒内通过能源匹配、预算分配和任何 LP/MIP 约束问题来解决任务调度。
Oraclaw Decide
100为 AI 代理提供决策智能。分析选项、使用 PageRank 映射决策依赖关系、检测信息源冲突,并找出最重要的选择。
MongoDB Connection Optimizer
100为任何支持的驱动程序语言优化 MongoDB 客户端连接配置(池、超时、模式)。在处理/更新/审查实例化或配置 MongoDB 客户端(例如,调用 `connect()` 时)、配置连接池、对连接错误(ECONNREFUSED、超时、池耗尽)进行故障排除、优化与连接相关的性能问题时,请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性,或调试与连接相关的失败等场景。