Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

VLLM High Performance LLM Serving

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Zweck

To enable users to deploy LLM APIs with high throughput and low latency using vLLM's advanced features for production environments.

Funktionen

High-throughput LLM serving
Optimized inference latency
Efficient memory usage with PagedAttention
OpenAI-compatible API endpoint
Support for quantization (AWQ, GPTQ, FP8)
Tensor parallelism for distributed serving

Anwendungsfälle

Deploying production LLM APIs
Optimizing inference latency and throughput
Serving large models with limited GPU memory
Building multi-user applications like chatbots

Nicht-Ziele

CPU-based inference
Research or prototyping with basic transformer implementations
NVIDIA-only, maximum-performance inference (TensorRT-LLM is an alternative)
Fine-tuning or training models

Praktiken

Production deployment
Performance optimization
Quantization
Distributed serving

Voraussetzungen

NVIDIA GPU with CUDA installed
Python environment
vLLM library installed

Execution

info:Pinned dependenciesThe SKILL.md lists `dependencies: [vllm, torch, transformers]` but does not explicitly declare pinned interpreter versions or side-effect headers for any bundled scripts, although installation instructions point to `pip install vllm`.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

97 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

VLLM High Performance LLM Serving

Funktionen

Anwendungsfälle

Nicht-Ziele

Praktiken

Voraussetzungen

Execution

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Tensorrt Llm

Hqq Quantization

Hqq Quantization

Llama Cpp

VLLM Inference Serving

Llama Cpp