Observability Designer
技能 已验证 活跃Observability Designer (POWERFUL)
To help engineers and SREs design robust, scalable, and cost-effective observability strategies for their production systems.
功能
- Generates SLI/SLO frameworks with error budgets and burn rate alerts
- Analyzes and optimizes existing alert configurations
- Creates role-specific and service-type optimized dashboard specifications
- Follows observability best practices for metrics, logs, and traces
- Provides recommendations for monitoring integration and implementation
使用场景
- When designing observability for a new service
- When needing to optimize existing alerting to reduce noise
- When creating comprehensive monitoring dashboards for different roles
- When establishing or refining SLOs and error budget policies
非目标
- Implementing or deploying monitoring infrastructure
- Directly integrating with specific cloud provider monitoring services
- Writing custom metric exporters or agents
工作流
- Define service characteristics (type, criticality, dependencies).
- Use `slo_designer.py` to generate SLIs, SLOs, error budgets, and alerts.
- Use `alert_optimizer.py` to analyze and improve existing alerts.
- Use `dashboard_generator.py` to create monitoring dashboards tailored to roles and services.
- Integrate generated configurations into the monitoring stack (Prometheus, Grafana, Alertmanager).
实践
- SLO Design
- Alert Optimization
- Dashboard Design
- Monitoring Best Practices
先决条件
- Python 3.7+
安装
请先添加 Marketplace
/plugin marketplace add alirezarezvani/claude-skills/plugin install engineering@claude-code-skills质量评分
已验证类似扩展
Define SLO/SLI/SLA
99Establish Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) with error budget tracking, burn rate alerts, and automated reporting using Prometheus and tools like Sloth or Pyrra. Use when defining reliability targets for customer-facing services, balancing feature velocity against system reliability through error budgets, migrating from arbitrary uptime goals to data-driven metrics, or implementing Site Reliability Engineering practices.
Slo Architect
99Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.
Azure Monitor Query Py
100Azure Monitor Query SDK for Python. Use for querying Log Analytics workspaces and Azure Monitor metrics. Triggers: "azure-monitor-query", "LogsQueryClient", "MetricsQueryClient", "Log Analytics", "Kusto queries", "Azure metrics".
Grafana Dashboards
99Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Monitor Stream
99Stream live swarm events using the Monitor tool for real-time observability
LangSmith Observability
99LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.