Skip to main content

TensorRT LLM Inference Serving

Skill Verified Active

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Purpose

To enable users to achieve maximum inference throughput and lowest latency for LLMs on NVIDIA GPUs, particularly for production deployments requiring significant speedups and efficient resource utilization.

Features

  • Optimize LLM inference with NVIDIA TensorRT-LLM
  • Achieve high throughput and low latency
  • Support for production deployment on NVIDIA GPUs
  • Utilize quantization (FP8, INT4)
  • Configure in-flight batching and multi-GPU scaling

Use Cases

  • Deploying LLMs on NVIDIA A100/H100 GPUs for maximum performance.
  • Serving LLMs with low latency for real-time applications.
  • Optimizing inference costs by using quantization and efficient batching.
  • Scaling LLM serving across multiple GPUs or nodes.

Non-Goals

  • Model training or fine-tuning
  • Usage on non-NVIDIA hardware (e.g., AMD GPUs, CPUs)
  • General application development beyond LLM inference serving

Workflow

  1. Review use case and hardware requirements
  2. Install TensorRT-LLM via Docker or pip
  3. Configure and run basic inference or trtllm-serve
  4. Apply optimizations like quantization and batching
  5. Deploy across multiple GPUs or nodes if needed

Prerequisites

  • NVIDIA GPUs (A100/H100 recommended)
  • CUDA Toolkit (version compatible with TensorRT-LLM)
  • Python 3.10-3.12
  • Docker (recommended for consistent environment)

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified
99 /100
Analyzed about 19 hours ago

Trust Signals

Last commitabout 21 hours ago
Stars27.2k
LicenseMIT
Status
View Source

© 2025 SkillRepo · Find the right skill, skip the noise.