Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Tensorrt Llm

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Zweck

To enable users to achieve state-of-the-art performance for LLM inference in production environments by utilizing NVIDIA TensorRT-LLM's advanced optimization and serving capabilities.

Funktionen

Optimizes LLM inference with NVIDIA TensorRT-LLM
Achieves high throughput and low latency on NVIDIA GPUs
Supports production deployment scenarios
Demonstrates use of quantization (FP8, INT4)
Covers in-flight batching and multi-GPU scaling

Anwendungsfälle

Deploying LLMs in production on NVIDIA A100/H100 GPUs
Serving models requiring maximum throughput (e.g., 24,000+ tokens/sec)
Reducing inference latency for real-time applications
Utilizing quantized models (FP8/INT4) for memory and speed gains

Nicht-Ziele

Optimizing LLM inference on non-NVIDIA hardware
Providing a user-friendly Python-first API like vLLM
Edge deployment without NVIDIA GPUs
Using non-TensorRT quantization formats like GGUF

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 2 days ago

Vertrauenssignale

Letzter Commit18 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

TensorRT LLM Inference Serving

Skill

davila7

Miles RL Training

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

Skill

Orchestra-Research

VLLM High Performance LLM Serving

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Skill

Orchestra-Research

Miles Rl Training

Skill

davila7

Incident Response

100

Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.

Skill

rampstackco

Video

100

When the user wants to create, generate, or produce video content using AI tools or programmatic frameworks. Also use when the user mentions 'video production,' 'AI video,' 'Remotion,' 'Hyperframes,' 'HeyGen,' 'Synthesia,' 'Veo,' 'Runway,' 'Kling,' 'Pika,' 'video generation,' 'AI avatar,' 'talking head video,' 'programmatic video,' 'video template,' 'explainer video,' 'product demo video,' 'video pipeline,' or 'make me a video.' Use this for video creation, generation, and production workflows. For video content strategy and what to post, see social-content. For paid video ad creative, see ad-creative.

Skill

coreyhaines31