Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Constitutional Ai

Skill Verifiziert Aktiv

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

Zweck

To provide a clear understanding and practical implementation guide for Constitutional AI, enabling users to train AI models for safety alignment and reduce harmful outputs.

Funktionen

  • Explains Constitutional AI methodology
  • Details two-phase training approach (SL and RLAIF)
  • Provides Python code examples for each phase
  • Addresses common issues and offers solutions
  • Outlines hardware and compute requirements

Anwendungsfälle

  • Training AI models for safety alignment
  • Reducing harmful outputs in AI systems
  • Implementing explainable AI decisions
  • Scalable AI safety training without human labels

Nicht-Ziele

  • Directly performing RLHF training
  • Providing a pre-trained moderation model like LlamaGuard
  • Runtime content filtering solutions like NeMo Guardrails

Workflow

  1. Generate initial responses using a base model.
  2. Critique responses against a constitution.
  3. Revise responses based on critiques.
  4. Fine-tune the model on revised responses (SL phase).
  5. Generate comparison pairs of responses.
  6. Evaluate AI preferences based on the constitution.
  7. Train a preference model (reward model).
  8. Perform RL training using RLAIF (RL phase).

Praktiken

  • Safety Alignment
  • AI Training
  • Reinforcement Learning
  • Self-Critique
  • AI Feedback

Voraussetzungen

  • Python 3.7+
  • NVIDIA GPU (A100/H100 recommended)
  • transformers, torch, trl libraries
  • Sufficient VRAM (e.g., 40GB for 7B models)

Trust

  • info:Issues AttentionopenIssues90d is 17 and closedIssues90d is 4. The closure rate is below 50%, indicating slower responsiveness to new issues.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert about 19 hours ago

Vertrauenssignale

Letzter Commitabout 21 hours ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Constitutional Ai

98

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

Skill
Orchestra-Research

Llamaguard

95

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

Skill
Orchestra-Research

LlamaGuard

75

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

Skill
davila7

Fixflow

100

Führen Sie Codierungsaufgaben mit einem strengen Liefer-Workflow aus: Erstellen Sie einen vollständigen Plan, implementieren Sie Schritt für Schritt, führen Sie kontinuierlich Tests durch und committen Sie standardmäßig nach jedem Schritt (`per_step`). Unterstützt explizite Commit-Policy-Überschreibungen (`final_only`, `milestone`) und optional BDD (Given/When/Then), wenn Benutzer verhaltensgesteuerte Bereitstellung anfordern oder Anforderungen unklar sind.

Skill
majiayu000

Prompt Guard

100

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

Skill
Orchestra-Research

Gws Modelarmor Sanitize Prompt

99

Google Model Armor: Sanitize a user prompt through a Model Armor template.

Skill
googleworkspace