跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Pytorch Fsdp2

技能 已验证 活跃

Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.

目的

Enables users to correctly implement PyTorch FSDP2 in their training scripts to manage large models that exceed single-GPU memory or require advanced sharding strategies.

功能

  • Adds PyTorch FSDP2 integration to training scripts.
  • Correct initialization, sharding, and mixed precision configuration.
  • Guidance on distributed checkpointing (DCP).
  • Handles models exceeding single-GPU memory.
  • Supports DTensor-based sharding with DeviceMesh.

使用场景

  • When models do not fit on a single GPU.
  • For users needing DTensor-based per-parameter sharding.
  • When composing DP with Tensor Parallelism using DeviceMesh.
  • Integrating FSDP2 into existing PyTorch training loops.

非目标

  • Replacing standard DistributedDataParallel (DDP) when model fits on one GPU.
  • Using older FSDP1 versions.
  • Providing backwards-compatible checkpoints across PyTorch versions without careful management.
  • Supporting PyTorch versions without the FSDP2 stack.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Pytorch Lightning

99

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

技能
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能
davila7

TorchTitan Distributed LLM Pretraining

99

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

技能
Orchestra-Research

HuggingFace Accelerate

97

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能
Orchestra-Research

PyTorch Lightning

100

Deep learning framework (PyTorch Lightning). Organize PyTorch code into LightningModules, configure Trainers for multi-GPU/TPU, implement data pipelines, callbacks, logging (W&B, TensorBoard), distributed training (DDP, FSDP, DeepSpeed), for scalable neural network training.

技能
K-Dense-AI

Nnsight Remote Interpretability

99

Provides guidance for interpreting and manipulating neural network internals using nnsight with optional NDIF remote execution. Use when needing to run interpretability experiments on massive models (70B+) without local GPU resources, or when working with any PyTorch architecture.

技能
davila7