此内容尚未提供您的语言版本,正在以英文显示。

Gptq

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

目的

To enable users to deploy large LLMs on consumer GPUs or achieve faster inference speeds by quantizing models to 4-bit using the GPTQ method with minimal accuracy degradation.

功能

4-bit post-training quantization for LLMs
Minimal accuracy loss (<2% perplexity degradation)
4x memory reduction for large models
3-4x faster inference compared to FP16
Integration with Transformers, PEFT, and vLLM
Support for various kernel backends (ExLlamaV2, Marlin, Triton)
Guidance on calibration data selection and quantization configuration
Instructions for quantizing custom models

使用场景

Deploying large LLMs (70B, 405B) on memory-constrained consumer GPUs.
Reducing memory usage of LLMs for faster loading and inference.
Achieving significant inference speedups for real-time applications.
Fine-tuning quantized models using QLoRA for memory efficiency.

非目标

Providing pre-quantized models directly (focus is on the method).
Supporting quantization methods other than GPTQ.
Quantization during training (focus is on post-training quantization).
Optimizations for CPU-only inference (focus is on GPU acceleration).

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

97 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Peft Fine Tuning

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

技能

Orchestra-Research

Vector Index Tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

技能

wshobson

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

技能

ruvnet

Oraclaw Solver

100

为 AI 代理提供工业级的调度和资源优化。在几毫秒内通过能源匹配、预算分配和任何 LP/MIP 约束问题来解决任务调度。

技能

Whatsonyourmind

Oraclaw Decide

100

为 AI 代理提供决策智能。分析选项、使用 PageRank 映射决策依赖关系、检测信息源冲突，并找出最重要的选择。

技能

Whatsonyourmind

MongoDB Connection Optimizer

100

为任何支持的驱动程序语言优化 MongoDB 客户端连接配置（池、超时、模式）。在处理/更新/审查实例化或配置 MongoDB 客户端（例如，调用 `connect()` 时）、配置连接池、对连接错误（ECONNREFUSED、超时、池耗尽）进行故障排除、优化与连接相关的性能问题时，请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性，或调试与连接相关的失败等场景。

技能

mongodb