跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

GGUF Quantization

技能 已验证 活跃

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

目的

To guide users through the process of preparing and running AI models using GGUF format and llama.cpp for efficient inference on various hardware.

功能

  • GGUF format conversion and quantization
  • llama.cpp build and usage instructions
  • Detailed quantization type explanations
  • Python bindings and server mode examples
  • Hardware-specific optimization guides (CPU, Metal, CUDA)

使用场景

  • Deploying LLMs on consumer hardware with limited VRAM
  • Running models efficiently on Apple Silicon with Metal acceleration
  • Achieving flexible quantization from 2-8 bit without GPU requirements
  • Integrating llama.cpp into custom applications or workflows

非目标

  • Providing pre-quantized models directly
  • Covering other quantization formats like AWQ or GPTQ
  • Detailed LLM architecture explanations beyond inference

工作流

  1. Install llama.cpp and its dependencies.
  2. Convert a HuggingFace model to GGUF format.
  3. Quantize the GGUF model to a desired bit precision.
  4. Run inference using the quantized model via CLI, Python, or server.

先决条件

  • llama.cpp build environment (compiler, make)
  • Python 3.8+
  • HuggingFace models (for conversion)

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

GGUF Quantization

95

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

技能
davila7

Llama Cpp

95

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

技能
Orchestra-Research

Llama Cpp

85

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

技能
davila7

Hugging Face Local Models

95

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

技能
huggingface

Huggingface Llm Trainer

99

Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

技能
huggingface

Vector Index Tuning

99

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

技能
wshobson