Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Define SLO/SLI/SLA

Skill Verifiziert Aktiv
Teil von:Agent Almanac

Establish Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) with error budget tracking, burn rate alerts, and automated reporting using Prometheus and tools like Sloth or Pyrra. Use when defining reliability targets for customer-facing services, balancing feature velocity against system reliability through error budgets, migrating from arbitrary uptime goals to data-driven metrics, or implementing Site Reliability Engineering practices.

Zweck

Define and manage service reliability targets, balancing feature development with system stability through data-driven metrics and error budgets.

Funktionen

  • Define SLO, SLI, and SLA hierarchy
  • Select and measure key SLIs (Four Golden Signals)
  • Set realistic SLO targets and calculate error budgets
  • Implement SLO monitoring with Sloth and Prometheus
  • Visualize compliance and budget with Grafana dashboards
  • Establish error budget policies and automate enforcement

Anwendungsfälle

  • Defining reliability targets for customer-facing services
  • Balancing feature velocity against system reliability
  • Migrating from arbitrary uptime goals to data-driven metrics
  • Implementing Site Reliability Engineering (SRE) practices

Nicht-Ziele

  • Providing a fully automated SLO generation tool
  • Replacing general monitoring or alerting systems
  • Defining specific SLIs for every possible service type

Workflow

  1. Understand SLI, SLO, and SLA Hierarchy
  2. Select Appropriate SLIs
  3. Set SLO Targets and Time Windows
  4. Implement SLO Monitoring with Sloth
  5. Build Error Budget Dashboards
  6. Establish Error Budget Policy

Praktiken

  • Service Level Management
  • Observability
  • Site Reliability Engineering

Voraussetzungen

  • Prometheus for metrics collection
  • Tools like Sloth or Pyrra for SLO generation
  • Grafana for dashboarding

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Qualitätspunktzahl

Verifiziert
99 /100
Analysiert about 17 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne14
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Slo Architect

99

Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.

Skill
alirezarezvani

Observability Designer

100

Observability Designer (POWERFUL)

Skill
alirezarezvani

SRE Engineer

98

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.

Skill
jeffallan

Grafana Dashboards

99

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

Skill
wshobson

Plan Capacity

99

Perform capacity planning using historical metrics and growth models. Use predict_linear for forecasting, identify resource constraints, calculate headroom, and recommend scaling actions before saturation. Use before seasonal traffic spikes or product launches, during quarterly capacity reviews, when resource utilization trends upward, or before budget planning cycles.

Skill
pjt222

Chaos Engineering

99

Use when planning, running, or learning from chaos engineering experiments. Triggers on "chaos experiment", "fault injection", "gameday", "resilience test", "blast radius", "steady state", "abort criteria", "Chaos Toolkit", "Chaos Mesh", "Litmus", "Gremlin", "AWS FIS", or any deliberate failure-injection question. Ships experiment designer, blast-radius calculator, and postmortem generator (all stdlib Python), 4 references on chaos principles + experiment design + attack taxonomy + tooling landscape, and a /chaos-experiment slash command. Composes with feature-flags-architect (kill switches as abort triggers) and kubernetes-operator (common chaos targets).

Skill
alirezarezvani