Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Lm Evaluation Harness

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Zweck

To provide a standardized, robust, and user-friendly tool for evaluating and comparing LLM performance across a broad range of academic benchmarks.

Funktionen

Evaluates LLMs across 60+ academic benchmarks
Supports HuggingFace, vLLM, and API-based models
Offers detailed documentation for custom tasks and distributed evaluation
Provides examples for common workflows like model comparison and training progress tracking
Industry standard used by major AI labs

Anwendungsfälle

Benchmarking LLM quality for research papers
Comparing performance between different LLMs
Tracking LLM training progress over time
Validating model outputs against standardized metrics

Nicht-Ziele

Evaluating LLMs on proprietary, non-academic tasks
Providing a platform for training or fine-tuning LLMs
Offering a judgment on model 'intelligence' beyond benchmark scores

Workflow

Configure model and tasks
Run evaluation using lm_eval command
Analyze results from output file
Troubleshoot common issues based on documentation

Voraussetzungen

Python 3.8+
pip package manager
CUDA-enabled GPU (recommended for speed)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

Lm Evaluation Harness

Funktionen

Anwendungsfälle

Nicht-Ziele

Workflow

Voraussetzungen

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Evaluating Llms Harness

Context Compression

Scholar Evaluation

Nemo Evaluator Sdk

BigCode Evaluation Harness

Literature Review