跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Llm Evaluation

技能 已验证 活跃

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

目的

To empower users to systematically evaluate LLM applications using a comprehensive suite of automated metrics, human feedback mechanisms, and benchmarking tools, ensuring high quality and reliability.

功能

  • Implement automated metrics (BLEU, ROUGE, BERTScore, Accuracy, Precision/Recall/F1, MRR, NDCG)
  • Facilitate human evaluation with defined dimensions (Accuracy, Coherence, Relevance, Fluency, Safety)
  • Support LLM-as-judge patterns for response comparison and scoring
  • Provide Python code examples for quick start and metric calculation
  • Enable benchmarking and regression detection for LLM applications

使用场景

  • Measuring LLM application performance systematically
  • Comparing different models or prompts for specific tasks
  • Detecting performance regressions before deploying LLM applications
  • Validating improvements from prompt engineering or model fine-tuning

非目标

  • Performing LLM fine-tuning or model training
  • Deploying LLM applications to production environments
  • General-purpose code development or debugging unrelated to LLM evaluation

Code Execution

  • info:ValidationWhile the code demonstrates structured inputs and outputs for its functions, explicit schema validation libraries like Zod or Pydantic were not evident in the provided snippets.

安装

请先添加 Marketplace

/plugin marketplace add wshobson/agents
/plugin install llm-application-dev@claude-code-workflows

质量评分

已验证
96 /100
1 day ago 分析

信任信号

最近提交3 days ago
星标35.3k
许可证MIT
状态
查看源代码

类似扩展

TradeMemory Protocol

100

Evolution Engine 的领域知识 — 支持 LLM 从原始 OHLCV 数据中自主发现策略。涵盖生成-回测-选择-进化循环、向量化回测、样本外验证和策略梯度。在发现交易模式、运行回测、进化策略或审查进化日志时使用。由“evolve”、“discover patterns”、“backtest”、“evolution”、“strategy generation”、“candidate strategy”触发。

技能
mnemox-ai

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

技能
muratcankoylan

Evaluating Llms Harness

99

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

技能
davila7

Nemo Evaluator Sdk

98

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

技能
Orchestra-Research

Lm Evaluation Harness

98

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

技能
Orchestra-Research

BigCode Evaluation Harness

98

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

技能
Orchestra-Research