Agent Evaluation

Skill Verified Active

Use when testing skills, commands, or agents for quality. Use after creating new skills, before deploying agents, or when debugging inconsistent agent behavior. Triggers on "evaluate", "test quality", "is this skill working", or QA of AI workflows.

Purpose

Systematically assess and improve the quality and reliability of AI agents and skills through a structured evaluation process.

Features

Structured 5-dimension evaluation rubric
Methods for direct scoring and LLM-as-judge
Bias detection and mitigation strategies
Workflow examples for new skills and agent QA
Test case design principles including edge cases

Use Cases

Evaluating new skills before deployment
Debugging inconsistent agent behavior
Comparing different agent approaches or models
Performing systematic QA reviews of AI workflows

Non-Goals

Performing one-off manual reviews of simple outputs
Evaluating purely creative or subjective tasks
Adding significant latency to real-time evaluations

Workflow

Define criteria and thresholds
Create test cases (easy, medium, hard, adversarial)
Run direct scoring or LLM-as-judge evaluation
Compare outputs and validate against ground truth
Monitor agreement and iterate on prompts/skills

Practices

Evaluation methodology
Quality assurance
Agent testing
Rubric design

Installation

npx skills add guia-matthieu/clawfu-skills

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified

99 /100

Analyzed 1 day ago

Trust Signals

Last commitabout 1 month ago

GitHub owner guia-matthieu

Stars104

LicenseMIT

Websiteclawfu.com

Status

View Source

Similar Extensions

Telegram Crabbox E2e Proof

100

Use when reviewing, reproducing, or proving OpenClaw Telegram behavior with a real Telegram user on Crabbox, including PR review workflows that need an agent-controlled Telegram Desktop recording, TDLib user-driver commands, Convex-leased credentials, WebVNC observation, and motion-trimmed artifacts.

Skill

steipete

Review Skill Format

Review a SKILL.md file for compliance with the agentskills.io standard. Checks YAML frontmatter fields, required sections, line count limits, procedure step format, and registry synchronization. Use when a new skill needs format validation before merge, an existing skill has been modified and requires re-validation, performing a batch audit of all skills in a domain, or reviewing a contributor's skill submission in a pull request.

Skill

pjt222

Init

100

Creates, updates, or optimizes an AGENTS.md file for a repository with minimal, high-signal instructions covering non-discoverable coding conventions, tooling quirks, workflow preferences, and project-specific rules that agents cannot infer from reading the codebase. Use when setting up agent instructions or Claude configuration for a new repository, when an existing AGENTS.md is too long, generic, or stale, when agents repeatedly make avoidable mistakes, or when repository workflows have changed and the agent configuration needs pruning. Applies a discoverability filter—omitting anything Claude can learn from README, code, config, or directory structure—and a quality gate to verify each line remains accurate and operationally significant.

Skill

mcollina

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill

muratcankoylan

Qa

100

Interactive QA session where user reports bugs or issues conversationally, and the agent files GitHub issues. Explores the codebase in the background for context and domain language. Use when user wants to report bugs, do QA, file issues conversationally, or mentions "QA session".

Skill

mattpocock

Context7 Cli

100

Use the ctx7 CLI to fetch library documentation, manage AI coding skills, and configure Context7 MCP. Activate when the user mentions "ctx7" or "context7", needs current docs for any library, wants to install/search/generate skills, or needs to set up Context7 for their AI coding agent.

Skill

upstash