Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

VLLM High Performance LLM Serving

Skill Verifiziert Aktiv

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Zweck

To enable users to deploy LLM APIs with high throughput and low latency using vLLM's advanced features for production environments.

Funktionen

  • High-throughput LLM serving
  • Optimized inference latency
  • Efficient memory usage with PagedAttention
  • OpenAI-compatible API endpoint
  • Support for quantization (AWQ, GPTQ, FP8)
  • Tensor parallelism for distributed serving

Anwendungsfälle

  • Deploying production LLM APIs
  • Optimizing inference latency and throughput
  • Serving large models with limited GPU memory
  • Building multi-user applications like chatbots

Nicht-Ziele

  • CPU-based inference
  • Research or prototyping with basic transformer implementations
  • NVIDIA-only, maximum-performance inference (TensorRT-LLM is an alternative)
  • Fine-tuning or training models

Praktiken

  • Production deployment
  • Performance optimization
  • Quantization
  • Distributed serving

Voraussetzungen

  • NVIDIA GPU with CUDA installed
  • Python environment
  • vLLM library installed

Execution

  • info:Pinned dependenciesThe SKILL.md lists `dependencies: [vllm, torch, transformers]` but does not explicitly declare pinned interpreter versions or side-effect headers for any bundled scripts, although installation instructions point to `pip install vllm`.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
97 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Tensorrt Llm

98

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Skill
Orchestra-Research

Hqq Quantization

98

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Skill
Orchestra-Research

Hqq Quantization

96

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Skill
davila7

Llama Cpp

95

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
Orchestra-Research

VLLM Inference Serving

93

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill
davila7

Llama Cpp

85

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

Skill
davila7