Skip to main content

GGUF Quantization

Skill Active

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Purpose

To enable efficient deployment of large language models on consumer hardware by facilitating GGUF format conversion and flexible quantization using llama.cpp.

Features

  • GGUF model conversion from Hugging Face
  • Flexible quantization from 2-8 bit (K-quants, iMatrix)
  • CPU, Apple Silicon (Metal), and NVIDIA (CUDA) inference optimization
  • Python bindings for llama-cpp-python
  • OpenAI-compatible server mode

Use Cases

  • Deploying LLMs on consumer hardware (laptops, desktops)
  • Running models efficiently on Apple Silicon with Metal
  • Achieving flexible quantization without GPU requirements
  • Integrating local LLM inference into Python applications

Non-Goals

  • Providing pre-quantized models
  • Training or fine-tuning LLMs
  • Offering a GUI for model selection or management

Workflow

  1. Clone llama.cpp repository
  2. Build llama.cpp with desired hardware acceleration (CPU, CUDA, Metal)
  3. Convert HuggingFace model to GGUF format (FP16)
  4. Quantize GGUF model using various methods (K-quants, iMatrix)
  5. Run inference via CLI, Python bindings, or server mode

Practices

  • Model Quantization
  • Performance Optimization
  • Inference Engineering

Prerequisites

  • Git
  • Make
  • C++ compiler (GCC/Clang)
  • Python 3.8+
  • CUDA Toolkit (for NVIDIA GPU builds)
  • Hugging Face models

Trust

  • warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow response rate to open issues.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

95 /100
Analyzed about 20 hours ago

Trust Signals

Last commitabout 22 hours ago
Stars27.2k
LicenseMIT
Status
View Source

Similar Extensions

GGUF Quantization

98

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Skill
Orchestra-Research

Llama Cpp

95

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
Orchestra-Research

Llama Cpp

85

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
davila7

Hugging Face Local Models

95

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

Skill
huggingface

Cli Anything Quietshrink

99

Compress macOS screen recordings with zero CPU stress using Apple Silicon's hardware HEVC encoder. Typically reduces file size 70-90% while staying visually lossless. Computer stays silent during encoding.

Skill
hkuds

Huggingface Llm Trainer

99

Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

Skill
huggingface

© 2025 SkillRepo · Find the right skill, skip the noise.