Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Evaluation

Skill Verifiziert Aktiv

Teil von:Agent Skills for Context Engineering

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.

Zweck

To enable systematic evaluation of AI agent performance and quality, ensuring agents meet defined standards and drive desired outcomes.

Funktionen

Builds multi-dimensional evaluation rubrics
Implements LLM-as-judge methodologies
Designs test sets stratified by complexity
Provides framework for continuous evaluation pipelines
Includes example code for evaluation and monitoring

Anwendungsfälle

When testing agent performance systematically
When validating context engineering choices
When measuring improvements over time
When building quality gates for agent pipelines

Nicht-Ziele

Performing the evaluation itself
Automating agent development or tuning
Replacing human oversight entirely
Providing a specific agent to be evaluated

Praktiken

Evaluation methodology
Test design
Quality measurement
LLM-as-judge implementation

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add muratcankoylan/Agent-Skills-for-Context-Engineering

/plugin install Agent-Skills-for-Context-Engineering@context-engineering-marketplace

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert about 20 hours ago

Vertrauenssignale

Letzter Commitabout 1 month ago

GitHub-Inhaber muratcankoylan

Sterne15.6k

LizenzMIT

Status

Quellcode ansehen

Ähnliche Erweiterungen

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill

muratcankoylan

Design Workflow

100

Anti-AI-generic design guidelines. Use when creating UI prototypes, reviewing designs for generic AI patterns, or setting up a project design system.

Skill

spartan-stratos

Codex PR Review

100

Überprüft Pull Requests in Drupal 11 (oder anderen) Projekten gemäß der Codex-Methodik (Geschäftslogik, Edge Cases von Hooks/Queries, Sicherheit, Performance, Vollständigkeit). Generiert einen .md-Bericht im erkannten IDE-Ordner (.antigravity/, .cursor/, .vscode/ oder docs/) mit Befunden nach Schweregrad und umsetzbaren Lösungen. Verwenden Sie dies, wenn der Benutzer "Codex-Überprüfung", "PR-Überprüfung", "PR überprüfen", "PR überprüfen" anfordert.

Skill

j4rk0r

Test A2a Interop

100

Test A2A interoperability between agents by validating Agent Card conformance, exercising all task lifecycle states, and verifying streaming and error handling. Use when verifying a new A2A server implementation before deployment, validating interoperability between two or more A2A agents, running conformance tests in CI/CD for A2A services, debugging failures in multi-agent A2A workflows, or certifying that an agent meets A2A protocol requirements for a registry.

Skill

pjt222

Run Puzzle Tests

100

Run the jigsawR test suite via WSL R execution. Supports full suite, filtered by pattern, or single file. Interprets pass/fail/skip counts and identifies failing tests. Never uses --vanilla flag (renv needs .Rprofile for activation). Use after modifying any R source code, after adding a new puzzle type or feature, before committing changes to verify nothing is broken, or when debugging a specific test failure.

Skill

pjt222

Telegram Crabbox E2e Proof

100

Use when reviewing, reproducing, or proving OpenClaw Telegram behavior with a real Telegram user on Crabbox, including PR review workflows that need an agent-controlled Telegram Desktop recording, TDLib user-driver commands, Convex-leased credentials, WebVNC observation, and motion-trimmed artifacts.

Skill

steipete