Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Distributed Llm Pretraining Torchtitan

Skill Verifiziert Aktiv

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Zweck

To enable efficient and scalable pretraining of large language models natively within PyTorch, leveraging advanced parallelism and optimization techniques for maximum performance.

Funktionen

  • PyTorch-native distributed LLM pretraining
  • Composable 4D parallelism (FSDP2, TP, PP, CP)
  • Support for Float8 training on H100 GPUs
  • Pretraining for Llama 3.1, DeepSeek V3, and custom models
  • Distributed checkpointing and efficient resumption

Anwendungsfälle

  • Pretraining large language models from scratch (8B to 405B+)
  • Scaling LLM training across 8 to 512+ GPUs
  • Optimizing training performance with Float8 and torch.compile
  • Integrating custom models into a distributed training pipeline

Nicht-Ziele

  • Fine-tuning LLMs (use alternatives like Axolotl/TRL)
  • Inference optimization (use DeepSpeed for broader ecosystem)
  • Simple single-GPU training (consider smaller educational frameworks)
  • NVIDIA-only maximum performance without PyTorch integration (consider Megatron-LM)

Trust

  • info:Issues Attention17 issues opened, 4 closed in the last 90 days, indicating a closure rate below 50% and a need for faster response.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert about 22 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

TorchTitan Distributed LLM Pretraining

99

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Skill
Orchestra-Research

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

Skill
Orchestra-Research

Pytorch Lightning

99

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

Skill
Orchestra-Research

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Skill
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
davila7

HuggingFace Accelerate

97

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
Orchestra-Research