Ray Data
技能 已验证 活跃Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.
To enable scalable and efficient processing of large datasets for machine learning workloads, leveraging distributed computing and GPU acceleration.
功能
- Scalable data processing for ML workloads
- Streaming execution across CPU/GPU
- Support for Parquet/CSV/JSON/images
- Integrates with Ray Train, PyTorch, TensorFlow
- Scales from single machine to 100s of nodes
使用场景
- Batch inference pipelines
- Distributed data preprocessing
- Multi-modal data loading
- Distributed ETL pipelines
非目标
- Processing small datasets on a single machine (use Pandas)
- Performing SQL-like operations on tabular data (use Dask/Spark)
- Enterprise ETL and complex SQL queries (use Spark)
工作流
- Read data from various sources (cloud storage, Python objects).
- Transform data using vectorized or row-by-row operations, filtering, or grouping.
- Optionally accelerate transforms with GPUs.
- Write processed data to various formats (Parquet, CSV, JSON).
- Integrate with ML frameworks like PyTorch and TensorFlow for training.
先决条件
- ray[data]
- pyarrow
- pandas
Code Execution
- info:ValidationWhile the code demonstrates structured usage of Ray Data APIs, explicit mention or demonstration of schema validation libraries (like Zod or Pydantic) for input parameters is not evident.
Practical Utility
- info:Edge casesWhile the documentation covers core operations and performance, explicit documentation of failure modes (e.g., malformed input, rate limits on cloud storage) and their recovery steps is not detailed.
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
已验证类似扩展
Ray Train
99Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
Ray Data
95Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.
Polars
99Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.
Spark Engineer
99Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.
Openrlhf Training
99High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
Dask
98Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.