Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Llama Cpp

Skill Aktiv

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Zweck

To enable efficient and accessible LLM inference on hardware lacking NVIDIA GPUs, making local and edge LLM deployments feasible.

Funktionen

  • CPU-only inference
  • Apple Silicon (M1/M2/M3) optimization
  • AMD/Intel GPU support (non-CUDA)
  • GGUF quantization (1.5-8 bit)
  • OpenAI-compatible API server mode

Anwendungsfälle

  • Running LLMs on personal Macs or Linux machines
  • Edge deployments on resource-constrained devices
  • Local LLM development and testing without GPU hardware
  • Utilizing models when CUDA is unavailable

Nicht-Ziele

  • Maximizing throughput on high-end NVIDIA GPUs
  • Providing a Python-first API like vLLM or TensorRT-LLM
  • Managing cloud infrastructure for LLM serving

Trust

  • warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a low closure rate (approx. 23.5%) and potentially slow maintainer response.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

85 /100
Analysiert about 23 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Llama Cpp

95

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
Orchestra-Research

GGUF Quantization

95

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Skill
davila7

GGUF Quantization

98

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

Skill
Orchestra-Research

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill
Orchestra-Research

Hugging Face Local Models

95

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

Skill
huggingface

Cli Anything Quietshrink

99

Compress macOS screen recordings with zero CPU stress using Apple Silicon's hardware HEVC encoder. Typically reduces file size 70-90% while staying visually lossless. Computer stays silent during encoding.

Skill
hkuds