Constitutional Ai
Skill Verified ActiveAnthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.
Features
- Explains Constitutional AI methodology
- Details two-phase training approach (SL and RLAIF)
- Provides Python code examples for each phase
- Addresses common issues and offers solutions
- Outlines hardware and compute requirements
Use Cases
- Training AI models for safety alignment
- Reducing harmful outputs in AI systems
- Implementing explainable AI decisions
- Scalable AI safety training without human labels
Non-Goals
- Directly performing RLHF training
- Providing a pre-trained moderation model like LlamaGuard
- Runtime content filtering solutions like NeMo Guardrails
Workflow
- Generate initial responses using a base model.
- Critique responses against a constitution.
- Revise responses based on critiques.
- Fine-tune the model on revised responses (SL phase).
- Generate comparison pairs of responses.
- Evaluate AI preferences based on the constitution.
- Train a preference model (reward model).
- Perform RL training using RLAIF (RL phase).
Practices
- Safety Alignment
- AI Training
- Reinforcement Learning
- Self-Critique
- AI Feedback
Prerequisites
- Python 3.7+
- NVIDIA GPU (A100/H100 recommended)
- transformers, torch, trl libraries
- Sufficient VRAM (e.g., 40GB for 7B models)
Trust
- info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.
Installation
npx skills add davila7/claude-code-templatesRuns the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.
Quality Score
VerifiedTrust Signals
Similar Extensions
Constitutional Ai
98Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
Llamaguard
95Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
LlamaGuard
75Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
Fixflow
100Execute coding tasks with a strict delivery workflow: build a full plan, implement one step at a time, run tests continuously, and commit by default after each step (`per_step`). Support explicit commit policy overrides (`final_only`, `milestone`) and optional BDD (Given/When/Then) when users ask for behavior-driven delivery or requirements are unclear.
Prompt Guard
100Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.
Gws Modelarmor Sanitize Prompt
99Google Model Armor: Sanitize a user prompt through a Model Armor template.