Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Hqq Quantization

Skill Verifiziert Aktiv

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Zweck

To enable efficient LLM deployment by quantizing models to lower bit precision without calibration data, facilitating faster inference and reduced memory footprint.

Funktionen

  • Calibration-free LLM quantization (4/3/2-bit)
  • Multiple optimized inference backends (Marlin, TorchAO, ATen, etc.)
  • Seamless integration with HuggingFace Transformers and vLLM
  • Support for fine-tuning quantized models with PEFT/LoRA
  • Fast quantization workflows (minutes vs. hours)

Anwendungsfälle

  • Quantizing LLMs for faster inference without needing calibration datasets.
  • Reducing memory footprint of LLMs for deployment on resource-constrained environments.
  • Integrating quantized models into vLLM or HuggingFace Transformers pipelines.
  • Experimenting with extreme quantization levels (2-bit, 1-bit) for LLMs.

Nicht-Ziele

  • Performing calibration-based quantization (e.g., AWQ, GPTQ).
  • Providing CPU-focused quantization (refer to llama.cpp/GGUF).
  • Replacing simple 8-bit/4-bit quantization tools like bitsandbytes for basic use cases.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
96 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Implementing Llms Litgpt

100

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

Skill
davila7

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

Skill
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
davila7

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Skill
Orchestra-Research

Hqq Quantization

98

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Skill
Orchestra-Research

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill
Orchestra-Research