跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Evaluation

技能 已验证 活跃

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.

目的

To enable systematic evaluation of AI agent performance and quality, ensuring agents meet defined standards and drive desired outcomes.

功能

  • Builds multi-dimensional evaluation rubrics
  • Implements LLM-as-judge methodologies
  • Designs test sets stratified by complexity
  • Provides framework for continuous evaluation pipelines
  • Includes example code for evaluation and monitoring

使用场景

  • When testing agent performance systematically
  • When validating context engineering choices
  • When measuring improvements over time
  • When building quality gates for agent pipelines

非目标

  • Performing the evaluation itself
  • Automating agent development or tuning
  • Replacing human oversight entirely
  • Providing a specific agent to be evaluated

实践

  • Evaluation methodology
  • Test design
  • Quality measurement
  • LLM-as-judge implementation

安装

请先添加 Marketplace

/plugin marketplace add muratcankoylan/Agent-Skills-for-Context-Engineering
/plugin install Agent-Skills-for-Context-Engineering@context-engineering-marketplace

质量评分

已验证
98 /100
1 day ago 分析

信任信号

最近提交about 1 month ago
星标15.6k
许可证MIT
状态
查看源代码

类似扩展

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

技能
muratcankoylan

Design Workflow

100

Anti-AI-generic design guidelines. Use when creating UI prototypes, reviewing designs for generic AI patterns, or setting up a project design system.

技能
spartan-stratos

Codex PR Review

100

Revisa pull requests en proyectos Drupal 11 (u otro) siguiendo la metodología Codex (lógica de negocio, edge cases de hooks/queries, seguridad, performance, completitud). Genera un informe .md en la carpeta del IDE detectado (.antigravity/, .cursor/, .vscode/ o docs/) con hallazgos por severidad y soluciones accionables. Usar cuando el usuario pida "revisión Codex", "revisión de PR", "revisar PR", "revisar PR"

技能
j4rk0r

Test A2a Interop

100

Test A2A interoperability between agents by validating Agent Card conformance, exercising all task lifecycle states, and verifying streaming and error handling. Use when verifying a new A2A server implementation before deployment, validating interoperability between two or more A2A agents, running conformance tests in CI/CD for A2A services, debugging failures in multi-agent A2A workflows, or certifying that an agent meets A2A protocol requirements for a registry.

技能
pjt222

Run Puzzle Tests

100

Run the jigsawR test suite via WSL R execution. Supports full suite, filtered by pattern, or single file. Interprets pass/fail/skip counts and identifies failing tests. Never uses --vanilla flag (renv needs .Rprofile for activation). Use after modifying any R source code, after adding a new puzzle type or feature, before committing changes to verify nothing is broken, or when debugging a specific test failure.

技能
pjt222

Telegram Crabbox E2e Proof

100

Use when reviewing, reproducing, or proving OpenClaw Telegram behavior with a real Telegram user on Crabbox, including PR review workflows that need an agent-controlled Telegram Desktop recording, TDLib user-driver commands, Convex-leased credentials, WebVNC observation, and motion-trimmed artifacts.

技能
steipete