此内容尚未提供您的语言版本,正在以英文显示。

Tensorrt Llm

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

目的

To enable users to achieve state-of-the-art performance for LLM inference in production environments by utilizing NVIDIA TensorRT-LLM's advanced optimization and serving capabilities.

功能

Optimizes LLM inference with NVIDIA TensorRT-LLM
Achieves high throughput and low latency on NVIDIA GPUs
Supports production deployment scenarios
Demonstrates use of quantization (FP8, INT4)
Covers in-flight batching and multi-GPU scaling

使用场景

Deploying LLMs in production on NVIDIA A100/H100 GPUs
Serving models requiring maximum throughput (e.g., 24,000+ tokens/sec)
Reducing inference latency for real-time applications
Utilizing quantized models (FP8/INT4) for memory and speed gains

非目标

Optimizing LLM inference on non-NVIDIA hardware
Providing a user-friendly Python-first API like vLLM
Edge deployment without NVIDIA GPUs
Using non-TensorRT quantization formats like GGUF

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

TensorRT LLM Inference Serving

技能

davila7

Miles RL Training

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

技能

Orchestra-Research

VLLM High Performance LLM Serving

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

技能

Orchestra-Research

Miles Rl Training

技能

davila7

Incident Response

100

Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.

技能

rampstackco

Video

100

When the user wants to create, generate, or produce video content using AI tools or programmatic frameworks. Also use when the user mentions 'video production,' 'AI video,' 'Remotion,' 'Hyperframes,' 'HeyGen,' 'Synthesia,' 'Veo,' 'Runway,' 'Kling,' 'Pika,' 'video generation,' 'AI avatar,' 'talking head video,' 'programmatic video,' 'video template,' 'explainer video,' 'product demo video,' 'video pipeline,' or 'make me a video.' Use this for video creation, generation, and production workflows. For video content strategy and what to post, see social-content. For paid video ad creative, see ad-creative.

技能

coreyhaines31