Constitutional Ai

Skill Verified Active

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

Purpose

To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.

Features

Explains Constitutional AI methodology
Details two-phase training approach (SL and RLAIF)
Provides Python code examples for each phase
Addresses common issues and offers solutions
Outlines hardware and compute requirements

Use Cases

Training AI models for safety alignment
Reducing harmful outputs in AI systems
Implementing explainable AI decisions
Scalable AI safety training without human labels

Non-Goals

Directly performing RLHF training
Providing a pre-trained moderation model like LlamaGuard
Runtime content filtering solutions like NeMo Guardrails

Workflow

Generate initial responses using a base model.
Critique responses against a constitution.
Revise responses based on critiques.
Fine-tune the model on revised responses (SL phase).
Generate comparison pairs of responses.
Evaluate AI preferences based on the constitution.
Train a preference model (reward model).
Perform RL training using RLAIF (RL phase).

Practices

Safety Alignment
AI Training
Reinforcement Learning
Self-Critique
AI Feedback

Prerequisites

Python 3.7+
NVIDIA GPU (A100/H100 recommended)
transformers, torch, trl libraries
Sufficient VRAM (e.g., 40GB for 7B models)

Trust

info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified

95 /100

Analyzed about 19 hours ago

Trust Signals

Last commitabout 21 hours ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

Constitutional Ai

Skill

Orchestra-Research

Llamaguard

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

Skill

Orchestra-Research

LlamaGuard

Skill

davila7

Fixflow

100

Execute coding tasks with a strict delivery workflow: build a full plan, implement one step at a time, run tests continuously, and commit by default after each step (`per_step`). Support explicit commit policy overrides (`final_only`, `milestone`) and optional BDD (Given/When/Then) when users ask for behavior-driven delivery or requirements are unclear.

Skill

majiayu000

Prompt Guard

100

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

Skill

Orchestra-Research

Gws Modelarmor Sanitize Prompt

Google Model Armor: Sanitize a user prompt through a Model Armor template.

Skill

googleworkspace