Tensorrt Llm
技能 已验证 活跃Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
To enable users to achieve state-of-the-art performance for LLM inference in production environments by utilizing NVIDIA TensorRT-LLM's advanced optimization and serving capabilities.
功能
- Optimizes LLM inference with NVIDIA TensorRT-LLM
- Achieves high throughput and low latency on NVIDIA GPUs
- Supports production deployment scenarios
- Demonstrates use of quantization (FP8, INT4)
- Covers in-flight batching and multi-GPU scaling
使用场景
- Deploying LLMs in production on NVIDIA A100/H100 GPUs
- Serving models requiring maximum throughput (e.g., 24,000+ tokens/sec)
- Reducing inference latency for real-time applications
- Utilizing quantized models (FP8/INT4) for memory and speed gains
非目标
- Optimizing LLM inference on non-NVIDIA hardware
- Providing a user-friendly Python-first API like vLLM
- Edge deployment without NVIDIA GPUs
- Using non-TensorRT quantization formats like GGUF
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
已验证类似扩展
TensorRT LLM Inference Serving
99Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Miles RL Training
97Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
VLLM High Performance LLM Serving
97Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
Miles Rl Training
92Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
Incident Response
100Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.
Video
100When the user wants to create, generate, or produce video content using AI tools or programmatic frameworks. Also use when the user mentions 'video production,' 'AI video,' 'Remotion,' 'Hyperframes,' 'HeyGen,' 'Synthesia,' 'Veo,' 'Runway,' 'Kling,' 'Pika,' 'video generation,' 'AI avatar,' 'talking head video,' 'programmatic video,' 'video template,' 'explainer video,' 'product demo video,' 'video pipeline,' or 'make me a video.' Use this for video creation, generation, and production workflows. For video content strategy and what to post, see social-content. For paid video ad creative, see ad-creative.