VLLM Inference Serving

Skill Active

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Purpose

To enable efficient, high-throughput deployment of Large Language Models for production APIs and applications, especially when optimizing for latency, throughput, or limited GPU memory.

Features

High-throughput LLM serving with vLLM
Optimized inference latency and throughput
Support for limited GPU memory scenarios
OpenAI-compatible API endpoints
Quantization support (GPTQ, AWQ, FP8)

Use Cases

Deploying production-ready LLM APIs
Optimizing inference performance for cost and speed
Serving large language models on resource-constrained hardware
Building applications that require low-latency, high-concurrency LLM interactions

Non-Goals

Training or fine-tuning LLMs
Providing a general-purpose Python inference library outside of vLLM's scope
Serving models without NVIDIA GPUs (primary focus)
Managing the entire cloud infrastructure for LLM deployment

Prerequisites

NVIDIA GPU with appropriate VRAM
CUDA toolkit installed
Python environment

Trust

warning:Issues AttentionThere are 17 open issues and 4 closed issues in the last 90 days, indicating a low closure rate and potentially slow maintainer response.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

93 /100

Analyzed 1 day ago

Trust Signals

Last commit1 day ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

Hqq Quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

Skill

Orchestra-Research

VLLM High Performance LLM Serving

Skill

Orchestra-Research

Hqq Quantization

Skill

davila7

AWQ Quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

Skill

Orchestra-Research

PyMC Bayesian Modeling

Bayesian modeling with PyMC. Build hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior checks, for probabilistic programming and inference.

Skill

K-Dense-AI

LLM Models via OpenRouter

Access Claude, Gemini, Kimi, GLM and 100+ LLMs via inference.sh CLI using OpenRouter. Models: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 3 Pro, Kimi K2, GLM-4.6, Intellect 3. One API for all models with automatic fallback and cost optimization. Use for: AI assistants, code generation, reasoning, agents, chat, content generation. Triggers: claude api, openrouter, llm api, claude sonnet, claude opus, gemini api, kimi, language model, gpt alternative, anthropic api, ai model api, llm access, chat api, claude alternative, openai alternative

Skill

inferen-sh