Agent Evaluation
Skill Verifiziert AktivEvaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
To empower users with systematic methods and best practices for evaluating and enhancing the performance, reliability, and quality of AI agents and their components.
Funktionen
- Structured evaluation methodologies (LLM-as-Judge, Human Eval)
- Comprehensive rubric design with scoring guidelines
- Techniques for mitigating LLM evaluation biases
- Practical prompt patterns and workflow examples
- Guidance on test case design and iteration
Anwendungsfälle
- Testing prompt effectiveness for AI agents
- Validating context engineering choices
- Measuring improvement quality of AI outputs
- Developing robust evaluation pipelines for AI systems
Nicht-Ziele
- Developing AI agents themselves
- Automating all aspects of AI evaluation without human oversight
- Providing domain-specific evaluation rubrics outside of general AI agent assessment
Praktiken
- Evaluation methodology
- Prompt engineering
- Test design
- Bias mitigation
Versioning
- info:Release ManagementWhile the trust signals indicate a recent commit date, there is no explicit versioning declared in the manifest or CHANGELOG, and installation instructions reference 'main'.
Installation
Zuerst Marketplace hinzufügen
/plugin marketplace add NeoLabHQ/context-engineering-kit/plugin install customaize-agent@context-engineering-kitQualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Create Command
100Interactive assistant for creating new Claude commands with proper structure, patterns, and MCP tool integration
Project Development
100This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches.
Write A Skill
100Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill.
Context Compression
100This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Prompt Optimization
100Wendet Prompt-Wiederholung an, um die Genauigkeit für LLMs ohne Schlussfolgerungsfähigkeit zu verbessern