Constitutional Ai
技能 已验证 活跃Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.
功能
- Explains Constitutional AI methodology
- Details two-phase training approach (SL and RLAIF)
- Provides Python code examples for each phase
- Addresses common issues and offers solutions
- Outlines hardware and compute requirements
使用场景
- Training AI models for safety alignment
- Reducing harmful outputs in AI systems
- Implementing explainable AI decisions
- Scalable AI safety training without human labels
非目标
- Directly performing RLHF training
- Providing a pre-trained moderation model like LlamaGuard
- Runtime content filtering solutions like NeMo Guardrails
工作流
- Generate initial responses using a base model.
- Critique responses against a constitution.
- Revise responses based on critiques.
- Fine-tune the model on revised responses (SL phase).
- Generate comparison pairs of responses.
- Evaluate AI preferences based on the constitution.
- Train a preference model (reward model).
- Perform RL training using RLAIF (RL phase).
实践
- Safety Alignment
- AI Training
- Reinforcement Learning
- Self-Critique
- AI Feedback
先决条件
- Python 3.7+
- NVIDIA GPU (A100/H100 recommended)
- transformers, torch, trl libraries
- Sufficient VRAM (e.g., 40GB for 7B models)
Trust
- info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.
安装
npx skills add davila7/claude-code-templates通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。
质量评分
已验证类似扩展
Constitutional Ai
98Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
Llamaguard
95Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
LlamaGuard
75Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
Fixflow
100使用严格的交付工作流执行编码任务:构建完整计划、分步实现、持续运行测试,并默认在每一步 (`per_step`) 后提交。当用户要求行为驱动交付或需求不明确时,支持显式提交策略覆盖 (`final_only`, `milestone`) 和可选的 BDD(给定/当/则)。
Prompt Guard
100Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.
Gws Modelarmor Sanitize Prompt
99Google Model Armor: Sanitize a user prompt through a Model Armor template.