Skip to main content

Ray Train

Skill Verified Active

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

Purpose

To enable users to efficiently scale their machine learning training workloads from single machines to thousands of nodes, facilitating large-scale model training and hyperparameter sweeps.

Features

  • Distributed training orchestration
  • Scales PyTorch, TensorFlow, HuggingFace
  • Hyperparameter tuning with Ray Tune
  • Fault tolerance and elastic scaling
  • Multi-node cluster setup and management

Use Cases

  • Training massive machine learning models across multiple machines.
  • Running distributed hyperparameter optimization sweeps.
  • Scaling existing single-node training code to multi-GPU or multi-node environments with minimal changes.
  • Setting up and managing Ray clusters for distributed training on local, cloud, or Kubernetes environments.

Non-Goals

  • Providing a full ML framework (relies on PyTorch, TensorFlow, etc.)
  • Managing individual node hardware or low-level OS configuration
  • Replacing simpler single-GPU training solutions unless scaling is required

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
99 /100
Analyzed about 20 hours ago

Trust Signals

Last commit16 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Skill
Orchestra-Research

Huggingface Accelerate

99

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
davila7

Pytorch Lightning

99

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

Skill
Orchestra-Research

TorchTitan Distributed LLM Pretraining

99

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Skill
Orchestra-Research

Distributed Llm Pretraining Torchtitan

98

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Skill
davila7

HuggingFace Accelerate

97

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

Skill
Orchestra-Research

© 2025 SkillRepo · Find the right skill, skip the noise.