跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Ray Data

技能 已验证 活跃

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

目的

To enable scalable and efficient processing of large datasets for machine learning workloads, leveraging distributed computing and GPU acceleration.

功能

  • Scalable data processing for ML workloads
  • Streaming execution across CPU/GPU
  • Support for Parquet/CSV/JSON/images
  • Integrates with Ray Train, PyTorch, TensorFlow
  • Scales from single machine to 100s of nodes

使用场景

  • Batch inference pipelines
  • Distributed data preprocessing
  • Multi-modal data loading
  • Distributed ETL pipelines

非目标

  • Processing small datasets on a single machine (use Pandas)
  • Performing SQL-like operations on tabular data (use Dask/Spark)
  • Enterprise ETL and complex SQL queries (use Spark)

工作流

  1. Read data from various sources (cloud storage, Python objects).
  2. Transform data using vectorized or row-by-row operations, filtering, or grouping.
  3. Optionally accelerate transforms with GPUs.
  4. Write processed data to various formats (Parquet, CSV, JSON).
  5. Integrate with ML frameworks like PyTorch and TensorFlow for training.

先决条件

  • ray[data]
  • pyarrow
  • pandas

Code Execution

  • info:ValidationWhile the code demonstrates structured usage of Ray Data APIs, explicit mention or demonstration of schema validation libraries (like Zod or Pydantic) for input parameters is not evident.

Practical Utility

  • info:Edge casesWhile the documentation covers core operations and performance, explicit documentation of failure modes (e.g., malformed input, rate limits on cloud storage) and their recovery steps is not detailed.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交17 days ago
星标8.3k
许可证MIT
状态
查看源代码

类似扩展

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

技能
Orchestra-Research

Ray Data

95

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

技能
davila7

Polars

99

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

技能
K-Dense-AI

Spark Engineer

99

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

技能
jeffallan

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能
Orchestra-Research

Dask

98

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

技能
K-Dense-AI