Eval
Skill Verified ActiveEvaluate and rank agent results by metric or LLM judge for an AgentHub session.
To provide a structured and objective way to assess the performance and quality of agent results within an AgentHub session.
Features
- Evaluate agent results by metric
- Evaluate agent results using LLM judge
- Support for hybrid evaluation modes
- Rank agent results for a session
- Update session state after evaluation
Use Cases
- Use when comparing multiple agent runs in a session.
- Use to objectively rank agent performance based on predefined metrics.
- Use when qualitative assessment of agent outputs is needed to break ties or provide context.
- Use after an agent session concludes to determine the best performing agent.
Non-Goals
- Running agent sessions themselves.
- Modifying agent configurations or parameters.
- Directly merging or deploying agent results.
Installation
First, add the marketplace
/plugin marketplace add alirezarezvani/claude-skills/plugin install agenthub@claude-code-skillsQuality Score
VerifiedTrust Signals
Similar Extensions
Context Compression
100This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.
Horizon Track
100Track long-horizon objectives across multiple sessions with milestone checkpoints, progress persistence, and drift detection
Treat
100Prune bloated session with a prescription. Removes progress ticks, stale reads, duplicate content, and more.
Guard
100Protect Claude Code sessions from context overflow by running a background daemon that monitors session size and auto-prunes before compaction hits. Use when the user says "guard", "protect session", "context getting long", "prevent compaction", "session management", or is running agent teams that need continuous context protection.
Claude Handoff
100Run /handoff to capture session data, then write a phased implementation plan that references it. Creates beads for tracking.
List Topics
100Use when the user asks about topics discussed in the current session, wants to see a topic list, or asks what has been talked about.