TorchTitan Distributed LLM Pretraining
Skill Verifiziert AktivProvides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Enables efficient and scalable pretraining of large language models using PyTorch's native distributed training capabilities.
Funktionen
- 4D parallelism (FSDP2, TP, PP, CP)
- PyTorch-native distributed training
- Float8 training for H100 GPUs
- Support for Llama 3.1, DeepSeek V3, and custom models
- Distributed checkpointing and interoperability
Anwendungsfälle
- Pretraining LLMs from scratch at scale (8 to 512+ GPUs)
- Leveraging PyTorch-native solutions for distributed training
- Optimizing training performance with Float8 on H100 GPUs
- Achieving interoperable checkpoints with torchtune/HuggingFace
Nicht-Ziele
- Fine-tuning LLMs (focus is pretraining)
- Providing a solution without PyTorch or its ecosystem
- Achieving maximum performance on NVIDIA-only deployments (vs. Megatron-LM)
- Offering inference support (focus is training)
Workflow
- Download tokenizer
- Configure training (TOML file)
- Launch training (script or torchrun)
- Monitor training (TensorBoard)
- Manage checkpoints
Praktiken
- Model Architecture
- Distributed Training
- Optimization
- LLM Pretraining
Voraussetzungen
- PyTorch >= 2.6.0
- TorchTitan >= 0.2.0
- TorchAO >= 0.5.0
- HuggingFace token for asset download
Installation
Zuerst Marketplace hinzufügen
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Distributed Llm Pretraining Torchtitan
98Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Ray Train
99Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
Pytorch Lightning
99High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
Openrlhf Training
99High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
Huggingface Accelerate
99Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
HuggingFace Accelerate
97Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.