跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

VLLM Inference Serving

技能 活跃

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

目的

To enable efficient, high-throughput deployment of Large Language Models for production APIs and applications, especially when optimizing for latency, throughput, or limited GPU memory.

功能

  • High-throughput LLM serving with vLLM
  • Optimized inference latency and throughput
  • Support for limited GPU memory scenarios
  • OpenAI-compatible API endpoints
  • Quantization support (GPTQ, AWQ, FP8)

使用场景

  • Deploying production-ready LLM APIs
  • Optimizing inference performance for cost and speed
  • Serving large language models on resource-constrained hardware
  • Building applications that require low-latency, high-concurrency LLM interactions

非目标

  • Training or fine-tuning LLMs
  • Providing a general-purpose Python inference library outside of vLLM's scope
  • Serving models without NVIDIA GPUs (primary focus)
  • Managing the entire cloud infrastructure for LLM deployment

先决条件

  • NVIDIA GPU with appropriate VRAM
  • CUDA toolkit installed
  • Python environment

Trust

  • warning:Issues AttentionThere are 17 open issues and 4 closed issues in the last 90 days, indicating a low closure rate and potentially slow maintainer response.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

93 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Hqq Quantization

98

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

技能
Orchestra-Research

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能
Orchestra-Research

Hqq Quantization

96

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

技能
davila7

AWQ Quantization

95

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

技能
Orchestra-Research

PyMC Bayesian Modeling

99

Bayesian modeling with PyMC. Build hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior checks, for probabilistic programming and inference.

技能
K-Dense-AI

LLM Models via OpenRouter

99

Access Claude, Gemini, Kimi, GLM and 100+ LLMs via inference.sh CLI using OpenRouter. Models: Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 3 Pro, Kimi K2, GLM-4.6, Intellect 3. One API for all models with automatic fallback and cost optimization. Use for: AI assistants, code generation, reasoning, agents, chat, content generation. Triggers: claude api, openrouter, llm api, claude sonnet, claude opus, gemini api, kimi, language model, gpt alternative, anthropic api, ai model api, llm access, chat api, claude alternative, openai alternative

技能
inferen-sh