TensorRT LLM Inference Serving
Skill Verifiziert AktivOptimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
To enable users to achieve maximum inference throughput and lowest latency for LLMs on NVIDIA GPUs, particularly for production deployments requiring significant speedups and efficient resource utilization.
Funktionen
- Optimize LLM inference with NVIDIA TensorRT-LLM
- Achieve high throughput and low latency
- Support for production deployment on NVIDIA GPUs
- Utilize quantization (FP8, INT4)
- Configure in-flight batching and multi-GPU scaling
Anwendungsfälle
- Deploying LLMs on NVIDIA A100/H100 GPUs for maximum performance.
- Serving LLMs with low latency for real-time applications.
- Optimizing inference costs by using quantization and efficient batching.
- Scaling LLM serving across multiple GPUs or nodes.
Nicht-Ziele
- Model training or fine-tuning
- Usage on non-NVIDIA hardware (e.g., AMD GPUs, CPUs)
- General application development beyond LLM inference serving
Workflow
- Review use case and hardware requirements
- Install TensorRT-LLM via Docker or pip
- Configure and run basic inference or trtllm-serve
- Apply optimizations like quantization and batching
- Deploy across multiple GPUs or nodes if needed
Voraussetzungen
- NVIDIA GPUs (A100/H100 recommended)
- CUDA Toolkit (version compatible with TensorRT-LLM)
- Python 3.10-3.12
- Docker (recommended for consistent environment)
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Tensorrt Llm
98Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Unsloth
100Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
Prompt Optimization
100Wendet Prompt-Wiederholung an, um die Genauigkeit für LLMs ohne Schlussfolgerungsfähigkeit zu verbessern
Chat Format
100Format prompts for different LLM providers with chat templates and HNSW-powered context retrieval
Oh My Claudecode
100Process-first advisor routing for Claude, Codex, or Gemini via `omc ask`, with artifact capture and no raw CLI assembly