Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

GGUF Quantization

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Zweck

To guide users through the process of preparing and running AI models using GGUF format and llama.cpp for efficient inference on various hardware.

Funktionen

GGUF format conversion and quantization
llama.cpp build and usage instructions
Detailed quantization type explanations
Python bindings and server mode examples
Hardware-specific optimization guides (CPU, Metal, CUDA)

Anwendungsfälle

Deploying LLMs on consumer hardware with limited VRAM
Running models efficiently on Apple Silicon with Metal acceleration
Achieving flexible quantization from 2-8 bit without GPU requirements
Integrating llama.cpp into custom applications or workflows

Nicht-Ziele

Providing pre-quantized models directly
Covering other quantization formats like AWQ or GPTQ
Detailed LLM architecture explanations beyond inference

Workflow

Install llama.cpp and its dependencies.
Convert a HuggingFace model to GGUF format.
Quantize the GGUF model to a desired bit precision.
Run inference using the quantized model via CLI, Python, or server.

Voraussetzungen

llama.cpp build environment (compiler, make)
Python 3.8+
HuggingFace models (for conversion)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

GGUF Quantization

Skill

davila7

Llama Cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill

Orchestra-Research

Llama Cpp

Skill

davila7

Hugging Face Local Models

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

Skill

huggingface

Huggingface Llm Trainer

Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

Skill

huggingface

Vector Index Tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

Skill

wshobson