Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Lm Evaluation Harness

Skill Verifiziert Aktiv

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Zweck

To provide a standardized, robust, and user-friendly tool for evaluating and comparing LLM performance across a broad range of academic benchmarks.

Funktionen

  • Evaluates LLMs across 60+ academic benchmarks
  • Supports HuggingFace, vLLM, and API-based models
  • Offers detailed documentation for custom tasks and distributed evaluation
  • Provides examples for common workflows like model comparison and training progress tracking
  • Industry standard used by major AI labs

Anwendungsfälle

  • Benchmarking LLM quality for research papers
  • Comparing performance between different LLMs
  • Tracking LLM training progress over time
  • Validating model outputs against standardized metrics

Nicht-Ziele

  • Evaluating LLMs on proprietary, non-academic tasks
  • Providing a platform for training or fine-tuning LLMs
  • Offering a judgment on model 'intelligence' beyond benchmark scores

Workflow

  1. Configure model and tasks
  2. Run evaluation using lm_eval command
  3. Analyze results from output file
  4. Troubleshoot common issues based on documentation

Voraussetzungen

  • Python 3.8+
  • pip package manager
  • CUDA-enabled GPU (recommended for speed)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Evaluating Llms Harness

99

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill
davila7

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill
muratcankoylan

Scholar Evaluation

98

Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and writing with quantitative scoring and actionable feedback.

Skill
K-Dense-AI

Nemo Evaluator Sdk

98

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill
Orchestra-Research

BigCode Evaluation Harness

98

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill
Orchestra-Research

Literature Review

100

Conduct comprehensive, systematic literature reviews using multiple academic databases (PubMed, arXiv, bioRxiv, Semantic Scholar, etc.). This skill should be used when conducting systematic literature reviews, meta-analyses, research synthesis, or comprehensive literature searches across biomedical, scientific, and technical domains. Creates professionally formatted markdown documents and PDFs with verified citations in multiple citation styles (APA, Nature, Vancouver, etc.).

Skill
K-Dense-AI