Simpo Training
技能 活跃Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for preference alignment when want simpler, faster training than DPO/PPO.
To provide an efficient, reference-free method for preference alignment of LLMs when simpler, faster training than DPO/PPO is desired.
功能
- Reference-free preference optimization (SimPO)
- Outperforms DPO on benchmark evaluations
- More efficient training than DPO/PPO
- Detailed configurations for multiple LLM architectures
- Troubleshooting and hyperparameter tuning guidance
使用场景
- Fine-tuning LLMs with preference data for alignment
- Training models when a reference model is unavailable or undesirable
- Achieving simpler and faster preference alignment compared to DPO/PPO
- Optimizing LLMs for specific task domains with preference feedback
非目标
- Performing standard supervised fine-tuning (SFT)
- Implementing DPO or PPO directly
- Training LLM architectures that do not support preference data formats
- Providing pre-trained models (focus is on the training methodology)
Code Execution
- info:ValidationWhile the configuration is provided in YAML, explicit schema validation libraries like Zod or Pydantic are not evident for input arguments or structured output handling.
Execution
- warning:Pinned dependenciesDependencies are listed but not explicitly pinned with version numbers or lockfiles in the SKILL.md, which could lead to compatibility issues.
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
类似扩展
Unsloth
100Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
Implementing Llms Litgpt
100Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Prompt Optimization
100应用提示重复以提高非推理 LLM 的准确性
Fine Tuning With Trl
96Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.
Chat Format
100Format prompts for different LLM providers with chat templates and HNSW-powered context retrieval