Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

TensorRT LLM Inference Serving

Skill Verifiziert Aktiv

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Zweck

To enable users to achieve maximum inference throughput and lowest latency for LLMs on NVIDIA GPUs, particularly for production deployments requiring significant speedups and efficient resource utilization.

Funktionen

Optimize LLM inference with NVIDIA TensorRT-LLM
Achieve high throughput and low latency
Support for production deployment on NVIDIA GPUs
Utilize quantization (FP8, INT4)
Configure in-flight batching and multi-GPU scaling

Anwendungsfälle

Deploying LLMs on NVIDIA A100/H100 GPUs for maximum performance.
Serving LLMs with low latency for real-time applications.
Optimizing inference costs by using quantization and efficient batching.
Scaling LLM serving across multiple GPUs or nodes.

Nicht-Ziele

Model training or fine-tuning
Usage on non-NVIDIA hardware (e.g., AMD GPUs, CPUs)
General application development beyond LLM inference serving

Workflow

Review use case and hardware requirements
Install TensorRT-LLM via Docker or pip
Configure and run basic inference or trtllm-serve
Apply optimizations like quantization and batching
Deploy across multiple GPUs or nodes if needed

Voraussetzungen

NVIDIA GPUs (A100/H100 recommended)
CUDA Toolkit (version compatible with TensorRT-LLM)
Python 3.10-3.12
Docker (recommended for consistent environment)

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert

99 /100

Analysiert about 23 hours ago

Vertrauenssignale

Letzter Commit1 day ago

GitHub-Inhaber davila7

Sterne27.2k

Downloads 23k

LizenzMIT

Websiteaitmpl.com

Status

Quellcode ansehen

TensorRT LLM Inference Serving

Funktionen

Anwendungsfälle

Nicht-Ziele

Workflow

Voraussetzungen

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Tensorrt Llm

Arize Prompt Optimization

Unsloth

Prompt Optimization

Chat Format

Oh My Claudecode