跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Tensorrt Llm

技能 已验证 活跃

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

目的

To enable users to achieve state-of-the-art performance for LLM inference in production environments by utilizing NVIDIA TensorRT-LLM's advanced optimization and serving capabilities.

功能

  • Optimizes LLM inference with NVIDIA TensorRT-LLM
  • Achieves high throughput and low latency on NVIDIA GPUs
  • Supports production deployment scenarios
  • Demonstrates use of quantization (FP8, INT4)
  • Covers in-flight batching and multi-GPU scaling

使用场景

  • Deploying LLMs in production on NVIDIA A100/H100 GPUs
  • Serving models requiring maximum throughput (e.g., 24,000+ tokens/sec)
  • Reducing inference latency for real-time applications
  • Utilizing quantized models (FP8/INT4) for memory and speed gains

非目标

  • Optimizing LLM inference on non-NVIDIA hardware
  • Providing a user-friendly Python-first API like vLLM
  • Edge deployment without NVIDIA GPUs
  • Using non-TensorRT quantization formats like GGUF

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

TensorRT LLM Inference Serving

99

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

技能
davila7

Miles RL Training

97

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

技能
Orchestra-Research

VLLM High Performance LLM Serving

97

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能
Orchestra-Research

Miles Rl Training

92

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

技能
davila7

Incident Response

100

Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.

技能
rampstackco

Video

100

When the user wants to create, generate, or produce video content using AI tools or programmatic frameworks. Also use when the user mentions 'video production,' 'AI video,' 'Remotion,' 'Hyperframes,' 'HeyGen,' 'Synthesia,' 'Veo,' 'Runway,' 'Kling,' 'Pika,' 'video generation,' 'AI avatar,' 'talking head video,' 'programmatic video,' 'video template,' 'explainer video,' 'product demo video,' 'video pipeline,' or 'make me a video.' Use this for video creation, generation, and production workflows. For video content strategy and what to post, see social-content. For paid video ad creative, see ad-creative.

技能
coreyhaines31