Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Llm Evaluation

Skill Verifiziert Aktiv

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Zweck

To empower users to systematically evaluate LLM applications using a comprehensive suite of automated metrics, human feedback mechanisms, and benchmarking tools, ensuring high quality and reliability.

Funktionen

Implement automated metrics (BLEU, ROUGE, BERTScore, Accuracy, Precision/Recall/F1, MRR, NDCG)
Facilitate human evaluation with defined dimensions (Accuracy, Coherence, Relevance, Fluency, Safety)
Support LLM-as-judge patterns for response comparison and scoring
Provide Python code examples for quick start and metric calculation
Enable benchmarking and regression detection for LLM applications

Anwendungsfälle

Measuring LLM application performance systematically
Comparing different models or prompts for specific tasks
Detecting performance regressions before deploying LLM applications
Validating improvements from prompt engineering or model fine-tuning

Nicht-Ziele

Performing LLM fine-tuning or model training
Deploying LLM applications to production environments
General-purpose code development or debugging unrelated to LLM evaluation

Code Execution

info:ValidationWhile the code demonstrates structured inputs and outputs for its functions, explicit schema validation libraries like Zod or Pydantic were not evident in the provided snippets.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add wshobson/agents

/plugin install llm-application-dev@claude-code-workflows

Qualitätspunktzahl

Verifiziert

96 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit3 days ago

GitHub-Inhaber wshobson

Sterne35.3k

LizenzMIT

Websitesethhobson.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

TradeMemory Protocol

100

Domänenwissen für die Evolution Engine — LLM-gestützte autonome Strategieentdeckung aus rohen OHLCV-Daten. Behandelt die Schleife Generieren-Backtesten-Auswählen-Entwickeln, vektorisiertes Backtesting, Out-of-Sample-Validierung und Strategiegraduierung. Verwenden Sie es beim Entdecken von Handelspatterns, Ausführen von Backtests, Entwickeln von Strategien oder Überprüfen von Evolutionsprotokollen. Löst aus bei "evolve", "discover patterns", "backtest", "evolution", "strategy generation", "candidate strategy".

Skill

mnemox-ai

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill

muratcankoylan

Evaluating Llms Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill

davila7

Nemo Evaluator Sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill

Orchestra-Research

Lm Evaluation Harness

Skill

Orchestra-Research

BigCode Evaluation Harness

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill

Orchestra-Research