Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

TorchTitan Distributed LLM Pretraining

Skill Verifiziert Aktiv

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Zweck

Enables efficient and scalable pretraining of large language models using PyTorch's native distributed training capabilities.

Funktionen

  • 4D parallelism (FSDP2, TP, PP, CP)
  • PyTorch-native distributed training
  • Float8 training for H100 GPUs
  • Support for Llama 3.1, DeepSeek V3, and custom models
  • Distributed checkpointing and interoperability

Anwendungsfälle

  • Pretraining LLMs from scratch at scale (8 to 512+ GPUs)
  • Leveraging PyTorch-native solutions for distributed training
  • Optimizing training performance with Float8 on H100 GPUs
  • Achieving interoperable checkpoints with torchtune/HuggingFace

Nicht-Ziele

  • Fine-tuning LLMs (focus is pretraining)
  • Providing a solution without PyTorch or its ecosystem
  • Achieving maximum performance on NVIDIA-only deployments (vs. Megatron-LM)
  • Offering inference support (focus is training)

Workflow

  1. Download tokenizer
  2. Configure training (TOML file)
  3. Launch training (script or torchrun)
  4. Monitor training (TensorBoard)
  5. Manage checkpoints

Praktiken

  • Model Architecture
  • Distributed Training
  • Optimization
  • LLM Pretraining

Voraussetzungen

  • PyTorch >= 2.6.0
  • TorchTitan >= 0.2.0
  • TorchAO >= 0.5.0
  • HuggingFace token for asset download

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
99 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Distributed Llm Pretraining Torchtitan

98

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Skill
davila7

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

Skill
Orchestra-Research

Pytorch Lightning

99

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

Skill
Orchestra-Research

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Skill
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
davila7

HuggingFace Accelerate

97

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
Orchestra-Research