AWQ Quantization
Skill Verifiziert AktivActivation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
To enable efficient deployment of large language models on resource-constrained hardware by compressing model weights with minimal performance degradation.
Funktionen
- Activation-aware weight quantization for 4-bit LLMs
- Minimal accuracy loss (<5%)
- Significant inference speedup (~2.5-3x)
- Support for various kernel backends (GEMM, GEMV, Marlin, ExLlama, IPEX)
- Integration with HuggingFace Transformers and vLLM
- Custom calibration data for domain-specific models
Anwendungsfälle
- Deploying large models (7B-70B) on limited GPU memory
- Achieving faster inference than GPTQ with better accuracy preservation
- Quantizing instruction-tuned and multimodal models
- Optimizing LLM serving for production environments
Nicht-Ziele
- Providing a general-purpose LLM training framework
- Replacing fine-tuning or other model adaptation techniques
- Supporting quantization methods other than 4-bit AWQ
Workflow
- Load model and tokenizer
- Define quantization configuration (bits, group size, kernel version)
- Quantize the model using calibration data
- Save the quantized model and tokenizer
- Load and use the quantized model for inference
Praktiken
- Model Optimization
- Quantization Techniques
- LLM Deployment
Voraussetzungen
- Python 3.8+
- CUDA 11.8+ (for NVIDIA GPUs)
- Compute Capability 7.5+ GPU (NVIDIA Turing or newer)
- transformers>=4.45.0
- torch>=2.0.0
Installation
Zuerst Marketplace hinzufügen
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Wrap Up Ritual
100End-of-session ritual that audits changes, runs quality checks, captures learnings, and produces a session summary. Use when saying "wrap up", "done for the day", "finish coding", or ending a coding session.
TradeMemory Protocol
100Domänenwissen für die Evolution Engine — LLM-gestützte autonome Strategieentdeckung aus rohen OHLCV-Daten. Behandelt die Schleife Generieren-Backtesten-Auswählen-Entwickeln, vektorisiertes Backtesting, Out-of-Sample-Validierung und Strategiegraduierung. Verwenden Sie es beim Entdecken von Handelspatterns, Ausführen von Backtests, Entwickeln von Strategien oder Überprüfen von Evolutionsprotokollen. Löst aus bei "evolve", "discover patterns", "backtest", "evolution", "strategy generation", "candidate strategy".
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Unsloth
100Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
Prompt Optimization
100Wendet Prompt-Wiederholung an, um die Genauigkeit für LLMs ohne Schlussfolgerungsfähigkeit zu verbessern
Vector Index Tuning
99Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.