此内容尚未提供您的语言版本,正在以英文显示。

Llama Cpp

技能活跃

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

目的

To enable efficient and accessible LLM inference on hardware lacking NVIDIA GPUs, making local and edge LLM deployments feasible.

功能

CPU-only inference
Apple Silicon (M1/M2/M3) optimization
AMD/Intel GPU support (non-CUDA)
GGUF quantization (1.5-8 bit)
OpenAI-compatible API server mode

使用场景

Running LLMs on personal Macs or Linux machines
Edge deployments on resource-constrained devices
Local LLM development and testing without GPU hardware
Utilizing models when CUDA is unavailable

非目标

Maximizing throughput on high-end NVIDIA GPUs
Providing a Python-first API like vLLM or TensorRT-LLM
Managing cloud infrastructure for LLM serving

Trust

warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a low closure rate (approx. 23.5%) and potentially slow maintainer response.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

85 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

Llama Cpp

技能

Orchestra-Research

GGUF Quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

技能

davila7

GGUF Quantization

技能

Orchestra-Research

VLLM High Performance LLM Serving

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能

Orchestra-Research

Hugging Face Local Models

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

技能

huggingface

Cli Anything Quietshrink

Compress macOS screen recordings with zero CPU stress using Apple Silicon's hardware HEVC encoder. Typically reduces file size 70-90% while staying visually lossless. Computer stays silent during encoding.

技能

hkuds