此内容尚未提供您的语言版本,正在以英文显示。

Hqq Quantization

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

目的

To enable users to quantize large language models efficiently and without calibration data, significantly reducing model size and memory footprint for faster inference and deployment.

功能

Calibration-free quantization for LLMs
Supports 8/4/3/2/1-bit precision
Multiple optimized inference backends (Marlin, TorchAO, etc.)
Seamless integration with HuggingFace Transformers and vLLM
Compatibility with PEFT/LoRA for fine-tuning quantized models

使用场景

Quantizing LLMs to 4-bit precision without needing calibration datasets
Performing fast quantization workflows for model compression
Deploying quantized LLMs with vLLM or HuggingFace Transformers
Fine-tuning quantized LLMs using PEFT and LoRA

非目标

Providing calibration-based quantization methods like AWQ or GPTQ
Performing model training from scratch
Serving models directly (relies on integration with frameworks like vLLM)

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Arize Prompt Optimization

100

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

技能

github

Unsloth

100

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

技能

davila7

Prompt Optimization

100

应用提示重复以提高非推理 LLM 的准确性

技能

asklokesh

Vector Index Tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

技能

wshobson

VLLM High Performance LLM Serving

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能

Orchestra-Research

Quantizing Models Bitsandbytes

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

技能

Orchestra-Research