Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Hqq Quantization

Skill Verifiziert Aktiv

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Zweck

To enable users to quantize large language models efficiently and without calibration data, significantly reducing model size and memory footprint for faster inference and deployment.

Funktionen

  • Calibration-free quantization for LLMs
  • Supports 8/4/3/2/1-bit precision
  • Multiple optimized inference backends (Marlin, TorchAO, etc.)
  • Seamless integration with HuggingFace Transformers and vLLM
  • Compatibility with PEFT/LoRA for fine-tuning quantized models

Anwendungsfälle

  • Quantizing LLMs to 4-bit precision without needing calibration datasets
  • Performing fast quantization workflows for model compression
  • Deploying quantized LLMs with vLLM or HuggingFace Transformers
  • Fine-tuning quantized LLMs using PEFT and LoRA

Nicht-Ziele

  • Providing calibration-based quantization methods like AWQ or GPTQ
  • Performing model training from scratch
  • Serving models directly (relies on integration with frameworks like vLLM)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Arize Prompt Optimization

100

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

Skill
github

Unsloth

100

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

Skill
davila7

Prompt Optimization

100

Wendet Prompt-Wiederholung an, um die Genauigkeit für LLMs ohne Schlussfolgerungsfähigkeit zu verbessern

Skill
asklokesh

Vector Index Tuning

99

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

Skill
wshobson

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill
Orchestra-Research

Quantizing Models Bitsandbytes

97

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

Skill
Orchestra-Research