Distributed Llm Pretraining Torchtitan
Skill Verified ActiveProvides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
To enable efficient and scalable pretraining of large language models natively within PyTorch, leveraging advanced parallelism and optimization techniques for maximum performance.
Features
- PyTorch-native distributed LLM pretraining
- Composable 4D parallelism (FSDP2, TP, PP, CP)
- Support for Float8 training on H100 GPUs
- Pretraining for Llama 3.1, DeepSeek V3, and custom models
- Distributed checkpointing and efficient resumption
Use Cases
- Pretraining large language models from scratch (8B to 405B+)
- Scaling LLM training across 8 to 512+ GPUs
- Optimizing training performance with Float8 and torch.compile
- Integrating custom models into a distributed training pipeline
Non-Goals
- Fine-tuning LLMs (use alternatives like Axolotl/TRL)
- Inference optimization (use DeepSpeed for broader ecosystem)
- Simple single-GPU training (consider smaller educational frameworks)
- NVIDIA-only maximum performance without PyTorch integration (consider Megatron-LM)
Trust
- info:Issues Attention17 issues opened, 4 closed in the last 90 days, indicating a closure rate below 50% and a need for faster response.
Installation
npx skills add davila7/claude-code-templatesRuns the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.
Quality Score
VerifiedTrust Signals
Similar Extensions
TorchTitan Distributed LLM Pretraining
99Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Ray Train
99Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
Pytorch Lightning
99High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
Openrlhf Training
99High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
Huggingface Accelerate
99Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
HuggingFace Accelerate
97Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.