此内容尚未提供您的语言版本,正在以英文显示。

Fine Tuning With Trl

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

目的

To enable users to fine-tune LLMs using various reinforcement learning methods and align them with human preferences or specific tasks.

功能

Supervised Fine-Tuning (SFT) for instruction tuning
Direct Preference Optimization (DPO) for preference alignment
Proximal Policy Optimization (PPO) for reward optimization
Group Relative Policy Optimization (GRPO) for memory-efficient RL
Reward model training for RLHF pipelines
Detailed workflows and code examples for each method

使用场景

Aligning LLMs with human preferences using preference data
Training instruction-following models
Performing full RLHF pipelines
Optimizing LLMs with minimal memory using GRPO

非目标

Basic fine-tuning without RL methods
Providing a GUI for training configuration
Hyperparameter optimization beyond standard guidance

Execution

info:Pinned dependenciesDependencies are listed in SKILL.md but not pinned with versions or accompanied by a lockfile, which could lead to compatibility issues.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

96 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Grpo Rl Training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

技能

Orchestra-Research

Verl Rl Training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

技能

davila7

Grpo Rl Training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

技能

davila7

Huggingface Llm Trainer

Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

技能

huggingface

Openrlhf Training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

技能

Orchestra-Research

Verl Rl Training

技能

Orchestra-Research