此内容尚未提供您的语言版本,正在以英文显示。

Lm Evaluation Harness

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

目的

To provide a standardized, robust, and user-friendly tool for evaluating and comparing LLM performance across a broad range of academic benchmarks.

功能

Evaluates LLMs across 60+ academic benchmarks
Supports HuggingFace, vLLM, and API-based models
Offers detailed documentation for custom tasks and distributed evaluation
Provides examples for common workflows like model comparison and training progress tracking
Industry standard used by major AI labs

使用场景

Benchmarking LLM quality for research papers
Comparing performance between different LLMs
Tracking LLM training progress over time
Validating model outputs against standardized metrics

非目标

Evaluating LLMs on proprietary, non-academic tasks
Providing a platform for training or fine-tuning LLMs
Offering a judgment on model 'intelligence' beyond benchmark scores

工作流

Configure model and tasks
Run evaluation using lm_eval command
Analyze results from output file
Troubleshoot common issues based on documentation

先决条件

Python 3.8+
pip package manager
CUDA-enabled GPU (recommended for speed)

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

98 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Evaluating Llms Harness

技能

davila7

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

技能

muratcankoylan

Scholar Evaluation

Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and writing with quantitative scoring and actionable feedback.

技能

K-Dense-AI

Nemo Evaluator Sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

技能

Orchestra-Research

BigCode Evaluation Harness

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

技能

Orchestra-Research

Literature Review

100

Conduct comprehensive, systematic literature reviews using multiple academic databases (PubMed, arXiv, bioRxiv, Semantic Scholar, etc.). This skill should be used when conducting systematic literature reviews, meta-analyses, research synthesis, or comprehensive literature searches across biomedical, scientific, and technical domains. Creates professionally formatted markdown documents and PDFs with verified citations in multiple citation styles (APA, Nature, Vancouver, etc.).

技能

K-Dense-AI