Skip to main content

Constitutional Ai

Skill Verified Active

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

Purpose

To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.

Features

  • Explains Constitutional AI methodology
  • Details two-phase training approach (SL and RLAIF)
  • Provides Python code examples for each phase
  • Addresses common issues and offers solutions
  • Outlines hardware and compute requirements

Use Cases

  • Training AI models for safety alignment
  • Reducing harmful outputs in AI systems
  • Implementing explainable AI decisions
  • Scalable AI safety training without human labels

Non-Goals

  • Directly performing RLHF training
  • Providing a pre-trained moderation model like LlamaGuard
  • Runtime content filtering solutions like NeMo Guardrails

Workflow

  1. Generate initial responses using a base model.
  2. Critique responses against a constitution.
  3. Revise responses based on critiques.
  4. Fine-tune the model on revised responses (SL phase).
  5. Generate comparison pairs of responses.
  6. Evaluate AI preferences based on the constitution.
  7. Train a preference model (reward model).
  8. Perform RL training using RLAIF (RL phase).

Practices

  • Safety Alignment
  • AI Training
  • Reinforcement Learning
  • Self-Critique
  • AI Feedback

Prerequisites

  • Python 3.7+
  • NVIDIA GPU (A100/H100 recommended)
  • transformers, torch, trl libraries
  • Sufficient VRAM (e.g., 40GB for 7B models)

Trust

  • info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified
95 /100
Analyzed about 19 hours ago

Trust Signals

Last commitabout 21 hours ago
Stars27.2k
LicenseMIT
Status
View Source

Similar Extensions

Constitutional Ai

98

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

Skill
Orchestra-Research

Llamaguard

95

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

Skill
Orchestra-Research

LlamaGuard

75

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

Skill
davila7

Fixflow

100

Execute coding tasks with a strict delivery workflow: build a full plan, implement one step at a time, run tests continuously, and commit by default after each step (`per_step`). Support explicit commit policy overrides (`final_only`, `milestone`) and optional BDD (Given/When/Then) when users ask for behavior-driven delivery or requirements are unclear.

Skill
majiayu000

Prompt Guard

100

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

Skill
Orchestra-Research

Gws Modelarmor Sanitize Prompt

99

Google Model Armor: Sanitize a user prompt through a Model Armor template.

Skill
googleworkspace

© 2025 SkillRepo · Find the right skill, skip the noise.