跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Advanced Evaluation

技能 已验证 活跃

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

目的

To empower users to build robust and unbiased LLM evaluation systems by mastering advanced LLM-as-a-Judge techniques and implementing production-grade patterns.

功能

  • Implement LLM-as-judge evaluation pipelines
  • Perform pairwise comparison with position bias mitigation
  • Generate domain-specific scoring rubrics
  • Mitigate systematic biases in LLM evaluations
  • Select appropriate metrics and evaluation strategies

使用场景

  • Building automated evaluation systems for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Designing A/B tests for prompt or model changes

非目标

  • Performing actual LLM generation
  • Evaluating non-textual outputs
  • Providing a generic prompt engineering skill

Trust

  • info:Issues AttentionIn the last 90 days, 6 issues were opened and 2 were closed, indicating slow but present maintainer engagement. The closure rate is low (33%), but the number of open issues is relatively small.

安装

请先添加 Marketplace

/plugin marketplace add muratcankoylan/Agent-Skills-for-Context-Engineering
/plugin install Agent-Skills-for-Context-Engineering@context-engineering-marketplace

质量评分

已验证
96 /100
1 day ago 分析

信任信号

最近提交about 1 month ago
星标15.6k
许可证MIT
状态
查看源代码

类似扩展

Evaluation

98

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.

技能
muratcankoylan

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

技能
muratcankoylan

LinkedIn Humanizer

100

Scrub AI tells from any text draft OR audit a finished post against the 2026 algorithm heuristic checklist. Tier-based rewriter (forensic / strict / aesthetic / all) plus `--mode audit` for detection-only pass-fail review covering length, hook, CTA, format penalties, AI vocab. Sub-tools: emoji-pattern detector, multi-detector spread tester (GPTZero, Originality.ai, ZeroGPT, Sapling, Copyleaks), rule explainer. Triggers on "humanize", "de-AI", "review this draft", "audit before posting", "is this ready".

技能
sergebulaev

Convert Resume to Markdown

100

Convert a resume PDF to clean markdown for LLM parsing or candidate pipelines.

技能
iterationlayer

Sentiment Analyzer

100

Analyze sentiment in text using ML models. Use when: analyzing customer reviews; processing NPS feedback; monitoring brand mentions; evaluating campaign responses; categorizing support tickets

技能
guia-matthieu

LangSmith Observability

99

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

技能
Orchestra-Research