Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Megatron Core LLM Training

Skill Verifiziert Aktiv

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

Zweck

Enables users to efficiently train large language models at scale using advanced parallelism techniques offered by NVIDIA Megatron-Core, targeting maximum GPU utilization and production-ready deployments.

Funktionen

  • Trains LLMs from 2B to 462B parameters
  • Leverages NVIDIA Megatron-Core framework
  • Implements advanced parallelism strategies (TP, PP, SP, CP, EP)
  • Optimizes for maximum GPU efficiency (up to 47% MFU on H100)
  • Provides production-ready training recipes and configurations

Anwendungsfälle

  • Training models larger than 1 billion parameters
  • Achieving maximum GPU efficiency during LLM training
  • Requiring fine-grained control over tensor, pipeline, sequence, context, or expert parallelism
  • Deploying production-grade LLM training pipelines

Nicht-Ziele

  • Training models smaller than 1 billion parameters
  • Basic LLM fine-tuning without advanced parallelism
  • Using frameworks other than NVIDIA Megatron-Core for large-scale training

Workflow

  1. Choose parallelism configuration based on model size and GPU count
  2. Configure training hyperparameters (batch size, learning rate, optimizer)
  3. Set up distributed training environment (e.g., using torchrun)
  4. Launch training script with specified configurations
  5. Monitor performance metrics (MFU, throughput, loss)

Praktiken

  • Large-Scale Training
  • Distributed Systems
  • Model Parallelism
  • GPU Optimization

Voraussetzungen

  • NVIDIA GPUs (Ampere+ recommended, Hopper+ for FP8)
  • Python 3.8+
  • PyTorch 2.x+
  • Transformer Engine library
  • Apex library
  • Sufficient GPU memory and fast storage for checkpoints

Execution

  • info:Pinned dependenciesDependencies are listed in SKILL.md, but lockfiles for these dependencies are not explicitly provided or referenced for pinning.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Verl Rl Training

99

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Skill
Orchestra-Research

Training Llms Megatron

97

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

Skill
davila7

Verl Rl Training

95

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Skill
davila7

Incident Response

100

Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.

Skill
rampstackco

Video

100

When the user wants to create, generate, or produce video content using AI tools or programmatic frameworks. Also use when the user mentions 'video production,' 'AI video,' 'Remotion,' 'Hyperframes,' 'HeyGen,' 'Synthesia,' 'Veo,' 'Runway,' 'Kling,' 'Pika,' 'video generation,' 'AI avatar,' 'talking head video,' 'programmatic video,' 'video template,' 'explainer video,' 'product demo video,' 'video pipeline,' or 'make me a video.' Use this for video creation, generation, and production workflows. For video content strategy and what to post, see social-content. For paid video ad creative, see ad-creative.

Skill
coreyhaines31

Golang Concurrency Patterns

100

Go concurrency patterns for production services: context cancellation, errgroup, worker pools, bounded parallelism, fan-in/fan-out, and common race/deadlock pitfalls

Skill
bobmatnyc