Service Mesh Observability

Skill Verified Active

Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Purpose

To enable users to set up robust observability for their service meshes, facilitating debugging of performance issues and the implementation of service-level objectives.

Features

Implement distributed tracing for service meshes
Set up service mesh metrics and dashboards
Provide templates for Istio, Linkerd, Prometheus, Grafana, Jaeger
Guide on debugging latency and error issues
Assist in defining SLOs for service communication
Visualize service dependencies and topology

Use Cases

When setting up mesh monitoring and dashboards
When debugging latency or error issues within a service mesh
When defining and implementing SLOs for inter-service communication
When visualizing service dependencies and network topology

Non-Goals

Implementing the actual observability backend infrastructure (focus is on configuration and integration)
General-purpose monitoring outside of service meshes
Deep dives into specific tool internals beyond their integration with service meshes

Installation

First, add the marketplace

/plugin marketplace add wshobson/agents

/plugin install cloud-infrastructure@claude-code-workflows

Quality Score

Verified

98 /100

Analyzed about 11 hours ago

Trust Signals

Last commit2 days ago

GitHub owner wshobson

Stars35.3k

LicenseMIT

Websitesethhobson.com

Status

View Source

Similar Extensions

Setup Service Mesh

Deploy and configure a service mesh (Istio or Linkerd) to enable secure service-to-service communication, traffic management, observability, and policy enforcement in Kubernetes clusters. Covers installation, mTLS configuration, traffic routing, circuit breaking, and integration with monitoring tools. Use when microservices need encrypted service-to-service communication, fine-grained traffic control for canary or A/B deployments, observability across all service interactions without application changes, or consistent circuit breaking and retry policies.

Skill

pjt222

Grafana Dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

Skill

wshobson

Instrument Distributed Tracing

Instrument applications with OpenTelemetry for distributed tracing, including auto and manual instrumentation, context propagation, sampling strategies, and integration with Jaeger or Tempo. Use when debugging latency issues in distributed systems, understanding request flow across microservices, correlating traces with logs and metrics for root cause analysis, measuring end-to-end latency, or migrating from legacy tracing systems to OpenTelemetry.

Skill

pjt222

Plan Capacity

Perform capacity planning using historical metrics and growth models. Use predict_linear for forecasting, identify resource constraints, calculate headroom, and recommend scaling actions before saturation. Use before seasonal traffic spikes or product launches, during quarterly capacity reviews, when resource utilization trends upward, or before budget planning cycles.

Skill

pjt222

Define SLO/SLI/SLA

Establish Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) with error budget tracking, burn rate alerts, and automated reporting using Prometheus and tools like Sloth or Pyrra. Use when defining reliability targets for customer-facing services, balancing feature velocity against system reliability through error budgets, migrating from arbitrary uptime goals to data-driven metrics, or implementing Site Reliability Engineering practices.

Skill

pjt222

LangSmith Observability

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

Skill

Orchestra-Research