跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

AWQ Quantization

技能 已验证 活跃

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

目的

To enable efficient deployment of large language models on resource-constrained hardware by compressing model weights with minimal performance degradation.

功能

  • Activation-aware weight quantization for 4-bit LLMs
  • Minimal accuracy loss (<5%)
  • Significant inference speedup (~2.5-3x)
  • Support for various kernel backends (GEMM, GEMV, Marlin, ExLlama, IPEX)
  • Integration with HuggingFace Transformers and vLLM
  • Custom calibration data for domain-specific models

使用场景

  • Deploying large models (7B-70B) on limited GPU memory
  • Achieving faster inference than GPTQ with better accuracy preservation
  • Quantizing instruction-tuned and multimodal models
  • Optimizing LLM serving for production environments

非目标

  • Providing a general-purpose LLM training framework
  • Replacing fine-tuning or other model adaptation techniques
  • Supporting quantization methods other than 4-bit AWQ

工作流

  1. Load model and tokenizer
  2. Define quantization configuration (bits, group size, kernel version)
  3. Quantize the model using calibration data
  4. Save the quantized model and tokenizer
  5. Load and use the quantized model for inference

实践

  • Model Optimization
  • Quantization Techniques
  • LLM Deployment

先决条件

  • Python 3.8+
  • CUDA 11.8+ (for NVIDIA GPUs)
  • Compute Capability 7.5+ GPU (NVIDIA Turing or newer)
  • transformers>=4.45.0
  • torch>=2.0.0

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Wrap Up Ritual

100

End-of-session ritual that audits changes, runs quality checks, captures learnings, and produces a session summary. Use when saying "wrap up", "done for the day", "finish coding", or ending a coding session.

技能
rohitg00

TradeMemory Protocol

100

Evolution Engine 的领域知识 — 支持 LLM 从原始 OHLCV 数据中自主发现策略。涵盖生成-回测-选择-进化循环、向量化回测、样本外验证和策略梯度。在发现交易模式、运行回测、进化策略或审查进化日志时使用。由“evolve”、“discover patterns”、“backtest”、“evolution”、“strategy generation”、“candidate strategy”触发。

技能
mnemox-ai

Arize Prompt Optimization

100

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

技能
github

Unsloth

100

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

技能
davila7

Prompt Optimization

100

应用提示重复以提高非推理 LLM 的准确性

技能
asklokesh

Vector Index Tuning

99

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

技能
wshobson