此内容尚未提供您的语言版本,正在以英文显示。

Agent Evaluation

技能已验证活跃

Use when testing skills, commands, or agents for quality. Use after creating new skills, before deploying agents, or when debugging inconsistent agent behavior. Triggers on "evaluate", "test quality", "is this skill working", or QA of AI workflows.

目的

Systematically assess and improve the quality and reliability of AI agents and skills through a structured evaluation process.

功能

Structured 5-dimension evaluation rubric
Methods for direct scoring and LLM-as-judge
Bias detection and mitigation strategies
Workflow examples for new skills and agent QA
Test case design principles including edge cases

使用场景

Evaluating new skills before deployment
Debugging inconsistent agent behavior
Comparing different agent approaches or models
Performing systematic QA reviews of AI workflows

非目标

Performing one-off manual reviews of simple outputs
Evaluating purely creative or subjective tasks
Adding significant latency to real-time evaluations

工作流

Define criteria and thresholds
Create test cases (easy, medium, hard, adversarial)
Run direct scoring or LLM-as-judge evaluation
Compare outputs and validate against ground truth
Monitor agreement and iterate on prompts/skills

实践

Evaluation methodology
Quality assurance
Agent testing
Rubric design

安装

npx skills add guia-matthieu/clawfu-skills

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证

99 /100

1 day ago 分析

信任信号

最近提交about 1 month ago

GitHub 所有者 guia-matthieu

星标104

许可证MIT

网站clawfu.com

状态

查看源代码

类似扩展

Telegram Crabbox E2e Proof

100

Use when reviewing, reproducing, or proving OpenClaw Telegram behavior with a real Telegram user on Crabbox, including PR review workflows that need an agent-controlled Telegram Desktop recording, TDLib user-driver commands, Convex-leased credentials, WebVNC observation, and motion-trimmed artifacts.

技能

steipete

Review Skill Format

Review a SKILL.md file for compliance with the agentskills.io standard. Checks YAML frontmatter fields, required sections, line count limits, procedure step format, and registry synchronization. Use when a new skill needs format validation before merge, an existing skill has been modified and requires re-validation, performing a batch audit of all skills in a domain, or reviewing a contributor's skill submission in a pull request.

技能

pjt222

Init

100

创建或优化存储库的 AGENTS.md 文件，提供最少、高信号的说明，涵盖代理无法从代码库推断的不可发现的编码约定、工具怪癖、工作流偏好和项目特定规则。在为新存储库设置代理说明或 Claude 配置时，当现有的 AGENTS.md 文件过长、通用或过时，当代理反复犯可避免的错误，或当存储库工作流发生变化且需要修剪代理配置时使用。应用可发现性过滤器—省略 Claude 可从 README、代码、配置或目录结构中学到的任何内容—并应用质量门，以验证每行是否仍然准确且具有操作意义。

技能

mcollina

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

技能

muratcankoylan

Qa

100

Interactive QA session where user reports bugs or issues conversationally, and the agent files GitHub issues. Explores the codebase in the background for context and domain language. Use when user wants to report bugs, do QA, file issues conversationally, or mentions "QA session".

技能

mattpocock

Context7 Cli

100

使用 ctx7 CLI 获取库文档、管理 AI 编码技能并配置 Context7 MCP。当用户提及“ctx7”或“context7”时，需要任何库的当前文档、希望安装/搜索/生成技能，或需要为 AI 编码代理设置 Context7 时激活。

技能

upstash