Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Ray Data

Skill Verifiziert Aktiv

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

Zweck

To enable scalable and efficient processing of large datasets for machine learning workloads, leveraging distributed computing and GPU acceleration.

Funktionen

  • Scalable data processing for ML workloads
  • Streaming execution across CPU/GPU
  • Support for Parquet/CSV/JSON/images
  • Integrates with Ray Train, PyTorch, TensorFlow
  • Scales from single machine to 100s of nodes

Anwendungsfälle

  • Batch inference pipelines
  • Distributed data preprocessing
  • Multi-modal data loading
  • Distributed ETL pipelines

Nicht-Ziele

  • Processing small datasets on a single machine (use Pandas)
  • Performing SQL-like operations on tabular data (use Dask/Spark)
  • Enterprise ETL and complex SQL queries (use Spark)

Workflow

  1. Read data from various sources (cloud storage, Python objects).
  2. Transform data using vectorized or row-by-row operations, filtering, or grouping.
  3. Optionally accelerate transforms with GPUs.
  4. Write processed data to various formats (Parquet, CSV, JSON).
  5. Integrate with ML frameworks like PyTorch and TensorFlow for training.

Voraussetzungen

  • ray[data]
  • pyarrow
  • pandas

Code Execution

  • info:ValidationWhile the code demonstrates structured usage of Ray Data APIs, explicit mention or demonstration of schema validation libraries (like Zod or Pydantic) for input parameters is not evident.

Practical Utility

  • info:Edge casesWhile the documentation covers core operations and performance, explicit documentation of failure modes (e.g., malformed input, rate limits on cloud storage) and their recovery steps is not detailed.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Ray Train

99

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

Skill
Orchestra-Research

Ray Data

95

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

Skill
davila7

Polars

99

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

Skill
K-Dense-AI

Spark Engineer

99

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

Skill
jeffallan

Openrlhf Training

99

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Skill
Orchestra-Research

Dask

98

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

Skill
K-Dense-AI