Skip to main content

Llama Cpp

Skill Verified Active

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Purpose

To enable cost-effective and accessible LLM inference on diverse consumer hardware, including edge devices and Macs, where high-end GPUs are unavailable or undesirable.

Features

  • LLM inference on CPU, Apple Silicon, and consumer GPUs
  • Support for GGUF quantization (1.5-8 bit)
  • 4-10x speedup vs PyTorch on CPU
  • OpenAI-compatible server mode
  • Hardware acceleration (Metal, CUDA, ROCm)

Use Cases

  • Edge device LLM deployment
  • Running LLMs on M1/M2/M3 Macs
  • Inference on AMD or Intel GPUs
  • Development environments where CUDA is unavailable

Non-Goals

  • Training LLMs
  • Utilizing NVIDIA GPUs with CUDA (use TensorRT-LLM instead)
  • Providing a Python-first API for NVIDIA GPUs (use vLLM instead)

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
95 /100
Analyzed about 18 hours ago

Trust Signals

Last commit16 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Llama Cpp

85

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
davila7

GGUF Quantization

95

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Skill
davila7

GGUF Quantization

98

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Skill
Orchestra-Research

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill
Orchestra-Research

Hugging Face Local Models

95

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

Skill
huggingface

Cli Anything Quietshrink

99

Compress macOS screen recordings with zero CPU stress using Apple Silicon's hardware HEVC encoder. Typically reduces file size 70-90% while staying visually lossless. Computer stays silent during encoding.

Skill
hkuds

© 2025 SkillRepo · Find the right skill, skip the noise.