跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Constitutional Ai

技能 已验证 活跃

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

目的

To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.

功能

  • Explains Constitutional AI methodology
  • Details two-phase training approach (SL and RLAIF)
  • Provides Python code examples for each phase
  • Addresses common issues and offers solutions
  • Outlines hardware and compute requirements

使用场景

  • Training AI models for safety alignment
  • Reducing harmful outputs in AI systems
  • Implementing explainable AI decisions
  • Scalable AI safety training without human labels

非目标

  • Directly performing RLHF training
  • Providing a pre-trained moderation model like LlamaGuard
  • Runtime content filtering solutions like NeMo Guardrails

工作流

  1. Generate initial responses using a base model.
  2. Critique responses against a constitution.
  3. Revise responses based on critiques.
  4. Fine-tune the model on revised responses (SL phase).
  5. Generate comparison pairs of responses.
  6. Evaluate AI preferences based on the constitution.
  7. Train a preference model (reward model).
  8. Perform RL training using RLAIF (RL phase).

实践

  • Safety Alignment
  • AI Training
  • Reinforcement Learning
  • Self-Critique
  • AI Feedback

先决条件

  • Python 3.7+
  • NVIDIA GPU (A100/H100 recommended)
  • transformers, torch, trl libraries
  • Sufficient VRAM (e.g., 40GB for 7B models)

Trust

  • info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Constitutional Ai

98

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

技能
Orchestra-Research

Llamaguard

95

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

技能
Orchestra-Research

LlamaGuard

75

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

技能
davila7

Fixflow

100

使用严格的交付工作流执行编码任务:构建完整计划、分步实现、持续运行测试,并默认在每一步 (`per_step`) 后提交。当用户要求行为驱动交付或需求不明确时,支持显式提交策略覆盖 (`final_only`, `milestone`) 和可选的 BDD(给定/当/则)。

技能
majiayu000

Prompt Guard

100

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

技能
Orchestra-Research

Gws Modelarmor Sanitize Prompt

99

Google Model Armor: Sanitize a user prompt through a Model Armor template.

技能
googleworkspace