Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

SRE Engineer

Skill Verifiziert Aktiv

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.

Zweck

To empower teams to build and maintain reliable, scalable production systems by providing practical tools and frameworks for SRE practices.

Funktionen

Defines SLOs and SLIs with calculation examples.
Creates error budget policies and tracking mechanisms.
Designs incident response procedures and postmortem templates.
Develops capacity models and automation scripts.
Produces monitoring configurations and alerting rules.

Anwendungsfälle

Defining meaningful SLIs and SLOs for production services.
Implementing error budget policies to balance reliability and velocity.
Setting up comprehensive monitoring and alerting for golden signals.
Automating repetitive operational tasks (toil reduction).
Designing and executing chaos engineering experiments.

Nicht-Ziele

Implementing direct CI/CD pipeline integration (provides scripts for it).
Replacing dedicated monitoring or alerting platforms (provides configurations for them).
Performing active incident response (provides frameworks and templates for it).

Praktiken

Site Reliability Engineering
Incident Management
Chaos Engineering
Monitoring and Alerting
Automation
Capacity Planning

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add jeffallan/claude-skills

/plugin install claude-skills@fullstack-dev-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert about 24 hours ago

Vertrauenssignale

Letzter Commit13 days ago

GitHub-Inhaber jeffallan

Sterne9k

LizenzMIT

Status

Quellcode ansehen

Ähnliche Erweiterungen

Define SLO/SLI/SLA

Establish Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) with error budget tracking, burn rate alerts, and automated reporting using Prometheus and tools like Sloth or Pyrra. Use when defining reliability targets for customer-facing services, balancing feature velocity against system reliability through error budgets, migrating from arbitrary uptime goals to data-driven metrics, or implementing Site Reliability Engineering practices.

Skill

pjt222

Slo Architect

Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.

Skill

alirezarezvani

Ops Fires

100

Production incidents dashboard. Reads ECS health, Sentry errors, CI failures. Offers to dispatch fix agents for active fires.

Skill

Lifecycle-Innovations-Limited

Observability Designer

100

Observability Designer (POWERFUL)

Skill

alirezarezvani

OpenClaw Release Maintainer

100

Prepare or verify OpenClaw stable/beta releases, changelogs, release notes, publish commands, and artifacts.

Skill

steipete

Incident Response

100

Manage active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and decision-making. Use this skill whenever the user has an active incident, a production issue, a service outage, a security incident, or needs to plan incident response procedures. Triggers on incident response, production incident, outage, service down, site down, P0, P1, severity, downtime, on-call, incident commander, status page, postmortem prep. Also triggers when something is actively broken in production and the user is figuring out what to do.

Skill

rampstackco