跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

TorchTitan Distributed LLM Pretraining

技能 已验证 活跃

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

目的

Enables efficient and scalable pretraining of large language models using PyTorch's native distributed training capabilities.

功能

  • 4D parallelism (FSDP2, TP, PP, CP)
  • PyTorch-native distributed training
  • Float8 training for H100 GPUs
  • Support for Llama 3.1, DeepSeek V3, and custom models
  • Distributed checkpointing and interoperability

使用场景

  • Pretraining LLMs from scratch at scale (8 to 512+ GPUs)
  • Leveraging PyTorch-native solutions for distributed training
  • Optimizing training performance with Float8 on H100 GPUs
  • Achieving interoperable checkpoints with torchtune/HuggingFace

非目标

  • Fine-tuning LLMs (focus is pretraining)
  • Providing a solution without PyTorch or its ecosystem
  • Achieving maximum performance on NVIDIA-only deployments (vs. Megatron-LM)
  • Offering inference support (focus is training)

工作流

  1. Download tokenizer
  2. Configure training (TOML file)
  3. Launch training (script or torchrun)
  4. Monitor training (TensorBoard)
  5. Manage checkpoints

实践

  • Model Architecture
  • Distributed Training
  • Optimization
  • LLM Pretraining

先决条件

  • PyTorch >= 2.6.0
  • TorchTitan >= 0.2.0
  • TorchAO >= 0.5.0
  • HuggingFace token for asset download

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
99 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Distributed Llm Pretraining Torchtitan

98

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

技能
davila7

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

技能
Orchestra-Research

Pytorch Lightning

99

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

技能
Orchestra-Research

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能
davila7

HuggingFace Accelerate

97

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能
Orchestra-Research