此内容尚未提供您的语言版本,正在以英文显示。

TensorRT LLM Inference Serving

技能已验证活跃

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

目的

To enable users to achieve maximum inference throughput and lowest latency for LLMs on NVIDIA GPUs, particularly for production deployments requiring significant speedups and efficient resource utilization.

功能

Optimize LLM inference with NVIDIA TensorRT-LLM
Achieve high throughput and low latency
Support for production deployment on NVIDIA GPUs
Utilize quantization (FP8, INT4)
Configure in-flight batching and multi-GPU scaling

使用场景

Deploying LLMs on NVIDIA A100/H100 GPUs for maximum performance.
Serving LLMs with low latency for real-time applications.
Optimizing inference costs by using quantization and efficient batching.
Scaling LLM serving across multiple GPUs or nodes.

非目标

Model training or fine-tuning
Usage on non-NVIDIA hardware (e.g., AMD GPUs, CPUs)
General application development beyond LLM inference serving

工作流

Review use case and hardware requirements
Install TensorRT-LLM via Docker or pip
Configure and run basic inference or trtllm-serve
Apply optimizations like quantization and batching
Deploy across multiple GPUs or nodes if needed

先决条件

NVIDIA GPUs (A100/H100 recommended)
CUDA Toolkit (version compatible with TensorRT-LLM)
Python 3.10-3.12
Docker (recommended for consistent environment)

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证

99 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

Process-first advisor routing for Claude, Codex, or Gemini via `omc ask`, with artifact capture and no raw CLI assembly

技能

Yeachan-Heo