Hugging Face Local Models
Skill Verified ActiveUse to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.
To enable users to easily select and run local language models using llama.cpp and GGUF formats, covering model discovery, quantization, and serving.
Features
- Find GGUF models on Hugging Face Hub
- Select optimal quantization levels
- Run models with llama-cli and llama-server
- Convert models from Transformers to GGUF
- Support for CPU, Metal, CUDA, and ROCm
Use Cases
- Selecting the best GGUF model for your hardware
- Running LLMs locally for privacy or cost savings
- Experimenting with different model quantizations
- Setting up an OpenAI-compatible local inference server
Non-Goals
- Training or fine-tuning models
- Managing Hugging Face Hub repositories directly (beyond downloading)
- Providing a full GUI for model management
Workflow
- Search Hugging Face Hub for llama.cpp-compatible GGUF models.
- Identify the recommended quant and file from the model's page or tree API.
- Install or ensure `llama.cpp` is available.
- Launch the model using `llama-cli` or `llama-server` with appropriate flags.
- Convert models from Transformers to GGUF if no pre-quantized version is available.
Prerequisites
- llama.cpp installed
- Python 3
- Hugging Face Hub CLI (optional for authentication)
Documentation
- info:Configuration & parameter referenceWhile the skill details model selection and launch commands, explicit documentation on configuration parameters for `llama-cli` or `llama-server` beyond basic flags is not provided.
Versioning
- info:Release ManagementThere is no explicit versioning (e.g., semver in frontmatter, CHANGELOG, or release tags) for this skill; installation instructions reference the `main` branch.
Practical Utility
- info:Edge casesWhile general guidance on quant choice and troubleshooting is provided, specific failure modes with symptoms and recovery steps are not extensively detailed.
Installation
/plugin install skills@huggingface-skillsQuality Score
VerifiedTrust Signals
Similar Extensions
GGUF Quantization
98GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
Hugging Science
98Use when the user is doing AI/ML work in a scientific domain — biology, chemistry, physics, astronomy, climate, genomics, materials science, medicine, ecology, energy, conservation, engineering, mathematics, scientific reasoning, drug discovery, protein design, weather modeling, theorem proving, single-cell, PDE solving, or anything similar. Hugging Science (huggingscience.co) is a curated catalog of scientific datasets, models, blog posts, and interactive Spaces; the `hugging-science` org on Hugging Face hosts community datasets, models, and demo Spaces. This skill helps you discover the right resource AND actually use it — loading datasets via `datasets`, running models via `transformers` or the HF Inference API, calling Spaces like BoltzGen via `gradio_client`, and citing blog posts for methodology. Trigger this skill whenever a user mentions a scientific ML task, asks for "a dataset/model for X" where X is a scientific topic, wants to fine-tune on scientific data, asks about protein / molecule / genome / climate / materials / astronomy / pathology / weather ML, or needs AI tools for research — even if they never say "Hugging Science" explicitly. The catalog is purpose-built for LLM agents (it ships an `llms-full.txt`); prefer it over generic web search for these tasks.
Llama Cpp
95Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
GGUF Quantization
95GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
Llama Cpp
85Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
Hf Cli
100Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing models, datasets, spaces, buckets, repos, papers, jobs, and more on the Hugging Face Hub. Use when: handling authentication; managing local cache; managing Hugging Face Buckets; running or scheduling jobs on Hugging Face infrastructure; managing Hugging Face repos; discussions and pull requests; browsing models, datasets and spaces; reading, searching, or browsing academic papers; managing collections; querying datasets; configuring spaces; setting up webhooks; or deploying and managing HF Inference Endpoints. Make sure to use this skill whenever the user mentions 'hf', 'huggingface', 'Hugging Face', 'huggingface-cli', or 'hugging face cli', or wants to do anything related to the Hugging Face ecosystem and to AI and ML in general. Also use for cloud storage needs like training checkpoints, data pipelines, or agent traces. Use even if the user doesn't explicitly ask for a CLI command. Replaces the deprecated `huggingface-cli`.