此内容尚未提供您的语言版本,正在以英文显示。

Ray Data

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

目的

To enable scalable and efficient processing of large datasets for machine learning workloads, leveraging distributed computing and GPU acceleration.

功能

Scalable data processing for ML workloads
Streaming execution across CPU/GPU
Support for Parquet/CSV/JSON/images
Integrates with Ray Train, PyTorch, TensorFlow
Scales from single machine to 100s of nodes

使用场景

Batch inference pipelines
Distributed data preprocessing
Multi-modal data loading
Distributed ETL pipelines

非目标

Processing small datasets on a single machine (use Pandas)
Performing SQL-like operations on tabular data (use Dask/Spark)
Enterprise ETL and complex SQL queries (use Spark)

工作流

Read data from various sources (cloud storage, Python objects).
Transform data using vectorized or row-by-row operations, filtering, or grouping.
Optionally accelerate transforms with GPUs.
Write processed data to various formats (Parquet, CSV, JSON).
Integrate with ML frameworks like PyTorch and TensorFlow for training.

先决条件

ray[data]
pyarrow
pandas

Code Execution

info:ValidationWhile the code demonstrates structured usage of Ray Data APIs, explicit mention or demonstration of schema validation libraries (like Zod or Pydantic) for input parameters is not evident.

Practical Utility

info:Edge casesWhile the documentation covers core operations and performance, explicit documentation of failure modes (e.g., malformed input, rate limits on cloud storage) and their recovery steps is not detailed.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

95 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Ray Train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

技能

Orchestra-Research

Ray Data

技能

davila7

Polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

技能

K-Dense-AI

Spark Engineer

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

技能

jeffallan

Openrlhf Training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能

Orchestra-Research

Dask

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

技能

K-Dense-AI