Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Grpo Rl Training

Skill Verifiziert Aktiv

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

Zweck

To enable users to effectively fine-tune language models using Group Relative Policy Optimization (GRPO) with TRL, particularly for tasks requiring specific output formats, verifiable correctness, and enhanced reasoning capabilities.

Funktionen

  • Expert GRPO/RL training guidance
  • Production-ready workflow implementation
  • Multiple reward function examples
  • Hyperparameter tuning and optimization advice
  • Dataset preparation and model setup patterns

Anwendungsfälle

  • Enforcing specific output formats (XML, JSON)
  • Teaching verifiable tasks with objective metrics
  • Improving model reasoning capabilities
  • Aligning models to domain-specific behaviors

Nicht-Ziele

  • Simple supervised fine-tuning tasks
  • Tasks without clear reward signals
  • Replacing DPO/PPO when preference data is abundant

Code Execution

  • info:LoggingThe template script includes `report_to="wandb"` and `logging_steps`, indicating logging is configured, but a dedicated local audit file is not explicitly mentioned.

Compliance

  • info:Telemetry opt-inThe training script includes `report_to="wandb"`, suggesting telemetry is used, but it defaults to ON and doesn't explicitly detail an opt-in mechanism or schema in the provided documentation.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Fine Tuning With Trl

96

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

Skill
Orchestra-Research

Grpo Rl Training

76

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

Skill
davila7

Huggingface Llm Trainer

99

Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

Skill
huggingface

Slime Rl Training

98

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

Skill
Orchestra-Research

Simpo Training

95

Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for preference alignment when want simpler, faster training than DPO/PPO.

Skill
Orchestra-Research

Verl Rl Training

95

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Skill
davila7