Skip to main content

Chaos Engineering

Skill Verified Active

Use when planning, running, or learning from chaos engineering experiments. Triggers on "chaos experiment", "fault injection", "gameday", "resilience test", "blast radius", "steady state", "abort criteria", "Chaos Toolkit", "Chaos Mesh", "Litmus", "Gremlin", "AWS FIS", or any deliberate failure-injection question. Ships experiment designer, blast-radius calculator, and postmortem generator (all stdlib Python), 4 references on chaos principles + experiment design + attack taxonomy + tooling landscape, and a /chaos-experiment slash command. Composes with feature-flags-architect (kill switches as abort triggers) and kubernetes-operator (common chaos targets).

Purpose

To facilitate the planning, execution, and learning from chaos engineering experiments by providing structured tools and best practices.

Features

  • Experiment design with hypothesis, steady-state, and abort criteria
  • Blast radius calculation for risk assessment
  • Postmortem generation for capturing learnings
  • Guidance on attack types and tooling selection
  • Structured workflows for single experiments and Game Days

Use Cases

  • Planning a chaos experiment to identify system weaknesses
  • Calculating the blast radius before running an experiment
  • Generating a structured postmortem after an experiment
  • Choosing the appropriate chaos engineering tool for a given stack

Non-Goals

  • General incident response
  • Threat hunting or red-teaming exercises
  • Performance load testing
  • Production debugging after an outage

Installation

First, add the marketplace

/plugin marketplace add alirezarezvani/claude-skills
/plugin install engineering@claude-code-skills

Quality Score

Verified
99 /100
Analyzed about 20 hours ago

Trust Signals

Last commitabout 23 hours ago
Stars14.6k
LicenseMIT
Status
View Source

Similar Extensions

Run Chaos Experiment

95

Design and execute chaos engineering experiments using Litmus or Chaos Mesh. Test system resilience through controlled fault injection, validate hypothesis-driven tests, and improve failure recovery. Use before major product launches, after architecture changes to validate resilience, during GameDays or disaster recovery drills, to validate assumptions about failure modes, or as part of an SRE maturity program.

Skill
pjt222

Chaos Engineer

99

Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.

Skill
jeffallan

Define SLO/SLI/SLA

99

Establish Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) with error budget tracking, burn rate alerts, and automated reporting using Prometheus and tools like Sloth or Pyrra. Use when defining reliability targets for customer-facing services, balancing feature velocity against system reliability through error budgets, migrating from arbitrary uptime goals to data-driven metrics, or implementing Site Reliability Engineering practices.

Skill
pjt222

Slo Architect

99

Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.

Skill
alirezarezvani

SRE Engineer

98

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.

Skill
jeffallan

Slo Implementation

97

Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.

Skill
wshobson

© 2025 SkillRepo · Find the right skill, skip the noise.