此内容尚未提供您的语言版本,正在以英文显示。

VLLM High Performance LLM Serving

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

目的

To enable users to deploy LLM APIs with high throughput and low latency using vLLM's advanced features for production environments.

功能

High-throughput LLM serving
Optimized inference latency
Efficient memory usage with PagedAttention
OpenAI-compatible API endpoint
Support for quantization (AWQ, GPTQ, FP8)
Tensor parallelism for distributed serving

使用场景

Deploying production LLM APIs
Optimizing inference latency and throughput
Serving large models with limited GPU memory
Building multi-user applications like chatbots

非目标

CPU-based inference
Research or prototyping with basic transformer implementations
NVIDIA-only, maximum-performance inference (TensorRT-LLM is an alternative)
Fine-tuning or training models

实践

Production deployment
Performance optimization
Quantization
Distributed serving

先决条件

NVIDIA GPU with CUDA installed
Python environment
vLLM library installed

Execution

info:Pinned dependenciesThe SKILL.md lists `dependencies: [vllm, torch, transformers]` but does not explicitly declare pinned interpreter versions or side-effect headers for any bundled scripts, although installation instructions point to `pip install vllm`.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

97 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Tensorrt Llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

技能

Orchestra-Research

Hqq Quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

技能

Orchestra-Research

Hqq Quantization

技能

davila7

Llama Cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

技能

Orchestra-Research

VLLM Inference Serving

技能

davila7

Llama Cpp

技能

davila7