此内容尚未提供您的语言版本,正在以英文显示。

Llama Cpp

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

目的

To enable cost-effective and accessible LLM inference on diverse consumer hardware, including edge devices and Macs, where high-end GPUs are unavailable or undesirable.

功能

LLM inference on CPU, Apple Silicon, and consumer GPUs
Support for GGUF quantization (1.5-8 bit)
4-10x speedup vs PyTorch on CPU
OpenAI-compatible server mode
Hardware acceleration (Metal, CUDA, ROCm)

使用场景

Edge device LLM deployment
Running LLMs on M1/M2/M3 Macs
Inference on AMD or Intel GPUs
Development environments where CUDA is unavailable

非目标

Training LLMs
Utilizing NVIDIA GPUs with CUDA (use TensorRT-LLM instead)
Providing a Python-first API for NVIDIA GPUs (use vLLM instead)

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

95 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Llama Cpp

技能

davila7

GGUF Quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

技能

davila7

GGUF Quantization

技能

Orchestra-Research

VLLM High Performance LLM Serving

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能

Orchestra-Research

Hugging Face Local Models

Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving.

技能

huggingface

Cli Anything Quietshrink

Compress macOS screen recordings with zero CPU stress using Apple Silicon's hardware HEVC encoder. Typically reduces file size 70-90% while staying visually lossless. Computer stays silent during encoding.

技能

hkuds