此内容尚未提供您的语言版本,正在以英文显示。

TorchTitan Distributed LLM Pretraining

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

目的

Enables efficient and scalable pretraining of large language models using PyTorch's native distributed training capabilities.

功能

4D parallelism (FSDP2, TP, PP, CP)
PyTorch-native distributed training
Float8 training for H100 GPUs
Support for Llama 3.1, DeepSeek V3, and custom models
Distributed checkpointing and interoperability

使用场景

Pretraining LLMs from scratch at scale (8 to 512+ GPUs)
Leveraging PyTorch-native solutions for distributed training
Optimizing training performance with Float8 on H100 GPUs
Achieving interoperable checkpoints with torchtune/HuggingFace

非目标

Fine-tuning LLMs (focus is pretraining)
Providing a solution without PyTorch or its ecosystem
Achieving maximum performance on NVIDIA-only deployments (vs. Megatron-LM)
Offering inference support (focus is training)

工作流

Download tokenizer
Configure training (TOML file)
Launch training (script or torchrun)
Monitor training (TensorBoard)
Manage checkpoints

实践

Model Architecture
Distributed Training
Optimization
LLM Pretraining

先决条件

PyTorch >= 2.6.0
TorchTitan >= 0.2.0
TorchAO >= 0.5.0
HuggingFace token for asset download

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

99 /100

about 2 months ago 分析

信任信号

最近提交2 months ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Distributed Llm Pretraining Torchtitan

技能

davila7

Ray Train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

技能

Orchestra-Research

Pytorch Lightning

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

技能

Orchestra-Research

Openrlhf Training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能

Orchestra-Research

Huggingface Accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能

davila7

HuggingFace Accelerate

技能

Orchestra-Research