Skip to main content

AWQ Quantization

Skill Verified Active

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

Purpose

To enable efficient deployment of large language models on resource-constrained hardware by compressing model weights with minimal performance degradation.

Features

  • Activation-aware weight quantization for 4-bit LLMs
  • Minimal accuracy loss (<5%)
  • Significant inference speedup (~2.5-3x)
  • Support for various kernel backends (GEMM, GEMV, Marlin, ExLlama, IPEX)
  • Integration with HuggingFace Transformers and vLLM
  • Custom calibration data for domain-specific models

Use Cases

  • Deploying large models (7B-70B) on limited GPU memory
  • Achieving faster inference than GPTQ with better accuracy preservation
  • Quantizing instruction-tuned and multimodal models
  • Optimizing LLM serving for production environments

Non-Goals

  • Providing a general-purpose LLM training framework
  • Replacing fine-tuning or other model adaptation techniques
  • Supporting quantization methods other than 4-bit AWQ

Workflow

  1. Load model and tokenizer
  2. Define quantization configuration (bits, group size, kernel version)
  3. Quantize the model using calibration data
  4. Save the quantized model and tokenizer
  5. Load and use the quantized model for inference

Practices

  • Model Optimization
  • Quantization Techniques
  • LLM Deployment

Prerequisites

  • Python 3.8+
  • CUDA 11.8+ (for NVIDIA GPUs)
  • Compute Capability 7.5+ GPU (NVIDIA Turing or newer)
  • transformers>=4.45.0
  • torch>=2.0.0

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
95 /100
Analyzed about 24 hours ago

Trust Signals

Last commit17 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Wrap Up Ritual

100

End-of-session ritual that audits changes, runs quality checks, captures learnings, and produces a session summary. Use when saying "wrap up", "done for the day", "finish coding", or ending a coding session.

Skill
rohitg00

TradeMemory Protocol

100

Domain knowledge for the Evolution Engine — LLM-powered autonomous strategy discovery from raw OHLCV data. Covers the generate-backtest-select-evolve loop, vectorized backtesting, out-of-sample validation, and strategy graduation. Use when discovering trading patterns, running backtests, evolving strategies, or reviewing evolution logs. Triggers on "evolve", "discover patterns", "backtest", "evolution", "strategy generation", "candidate strategy".

Skill
mnemox-ai

Arize Prompt Optimization

100

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

Skill
github

Unsloth

100

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

Skill
davila7

Prompt Optimization

100

Applies prompt repetition to improve accuracy for non-reasoning LLMs

Skill
asklokesh

Vector Index Tuning

99

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

Skill
wshobson

© 2025 SkillRepo · Find the right skill, skip the noise.