此内容尚未提供您的语言版本,正在以英文显示。

Distributed Llm Pretraining Torchtitan

技能已验证活跃

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

目的

To enable efficient and scalable pretraining of large language models natively within PyTorch, leveraging advanced parallelism and optimization techniques for maximum performance.

功能

PyTorch-native distributed LLM pretraining
Composable 4D parallelism (FSDP2, TP, PP, CP)
Support for Float8 training on H100 GPUs
Pretraining for Llama 3.1, DeepSeek V3, and custom models
Distributed checkpointing and efficient resumption

使用场景

Pretraining large language models from scratch (8B to 405B+)
Scaling LLM training across 8 to 512+ GPUs
Optimizing training performance with Float8 and torch.compile
Integrating custom models into a distributed training pipeline

非目标

Fine-tuning LLMs (use alternatives like Axolotl/TRL)
Inference optimization (use DeepSpeed for broader ecosystem)
Simple single-GPU training (consider smaller educational frameworks)
NVIDIA-only maximum performance without PyTorch integration (consider Megatron-LM)

Trust

info:Issues Attention17 issues opened, 4 closed in the last 90 days, indicating a closure rate below 50% and a need for faster response.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

TorchTitan Distributed LLM Pretraining

技能

Orchestra-Research

Ray Train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

技能

Orchestra-Research

Pytorch Lightning

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

技能

Orchestra-Research

Openrlhf Training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能

Orchestra-Research

Huggingface Accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能

davila7

HuggingFace Accelerate

技能

Orchestra-Research