Distributed Llm Pretraining Torchtitan
技能 已验证 活跃Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
To enable efficient and scalable pretraining of large language models natively within PyTorch, leveraging advanced parallelism and optimization techniques for maximum performance.
功能
- PyTorch-native distributed LLM pretraining
- Composable 4D parallelism (FSDP2, TP, PP, CP)
- Support for Float8 training on H100 GPUs
- Pretraining for Llama 3.1, DeepSeek V3, and custom models
- Distributed checkpointing and efficient resumption
使用场景
- Pretraining large language models from scratch (8B to 405B+)
- Scaling LLM training across 8 to 512+ GPUs
- Optimizing training performance with Float8 and torch.compile
- Integrating custom models into a distributed training pipeline
非目标
- Fine-tuning LLMs (use alternatives like Axolotl/TRL)
- Inference optimization (use DeepSpeed for broader ecosystem)
- Simple single-GPU training (consider smaller educational frameworks)
- NVIDIA-only maximum performance without PyTorch integration (consider Megatron-LM)
Trust
- info:Issues Attention17 issues opened, 4 closed in the last 90 days, indicating a closure rate below 50% and a need for faster response.
安装
npx skills add davila7/claude-code-templates通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。
质量评分
已验证类似扩展
TorchTitan Distributed LLM Pretraining
99Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Ray Train
99Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
Pytorch Lightning
99High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
Openrlhf Training
99High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
Huggingface Accelerate
99Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
HuggingFace Accelerate
97Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.