Constitutional Ai
Skill Verifiziert AktivAnthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.
Funktionen
- Explains Constitutional AI methodology
- Details two-phase training approach (SL and RLAIF)
- Provides Python code examples for each phase
- Addresses common issues and offers solutions
- Outlines hardware and compute requirements
Anwendungsfälle
- Training AI models for safety alignment
- Reducing harmful outputs in AI systems
- Implementing explainable AI decisions
- Scalable AI safety training without human labels
Nicht-Ziele
- Directly performing RLHF training
- Providing a pre-trained moderation model like LlamaGuard
- Runtime content filtering solutions like NeMo Guardrails
Workflow
- Generate initial responses using a base model.
- Critique responses against a constitution.
- Revise responses based on critiques.
- Fine-tune the model on revised responses (SL phase).
- Generate comparison pairs of responses.
- Evaluate AI preferences based on the constitution.
- Train a preference model (reward model).
- Perform RL training using RLAIF (RL phase).
Praktiken
- Safety Alignment
- AI Training
- Reinforcement Learning
- Self-Critique
- AI Feedback
Voraussetzungen
- Python 3.7+
- NVIDIA GPU (A100/H100 recommended)
- transformers, torch, trl libraries
- Sufficient VRAM (e.g., 40GB for 7B models)
Trust
- info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Constitutional Ai
98Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
Llamaguard
95Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
LlamaGuard
75Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
Fixflow
100Führen Sie Codierungsaufgaben mit einem strengen Liefer-Workflow aus: Erstellen Sie einen vollständigen Plan, implementieren Sie Schritt für Schritt, führen Sie kontinuierlich Tests durch und committen Sie standardmäßig nach jedem Schritt (`per_step`). Unterstützt explizite Commit-Policy-Überschreibungen (`final_only`, `milestone`) und optional BDD (Given/When/Then), wenn Benutzer verhaltensgesteuerte Bereitstellung anfordern oder Anforderungen unklar sind.
Prompt Guard
100Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.
Gws Modelarmor Sanitize Prompt
99Google Model Armor: Sanitize a user prompt through a Model Armor template.