此内容尚未提供您的语言版本,正在以英文显示。

Ray Train

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

目的

To enable users to efficiently scale their machine learning training workloads from single machines to thousands of nodes, facilitating large-scale model training and hyperparameter sweeps.

功能

Distributed training orchestration
Scales PyTorch, TensorFlow, HuggingFace
Hyperparameter tuning with Ray Tune
Fault tolerance and elastic scaling
Multi-node cluster setup and management

使用场景

Training massive machine learning models across multiple machines.
Running distributed hyperparameter optimization sweeps.
Scaling existing single-node training code to multi-GPU or multi-node environments with minimal changes.
Setting up and managing Ray clusters for distributed training on local, cloud, or Kubernetes environments.

非目标

Providing a full ML framework (relies on PyTorch, TensorFlow, etc.)
Managing individual node hardware or low-level OS configuration
Replacing simpler single-GPU training solutions unless scaling is required

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

99 /100

about 2 months ago 分析

信任信号

最近提交2 months ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Openrlhf Training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能

Orchestra-Research

Huggingface Accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

技能

davila7

Pytorch Lightning

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

技能

Orchestra-Research

TorchTitan Distributed LLM Pretraining

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

技能

Orchestra-Research

Distributed Llm Pretraining Torchtitan

技能

davila7

HuggingFace Accelerate

技能

Orchestra-Research