Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Nemo Evaluator Sdk

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Zweck

To enable users to perform scalable, reproducible, and enterprise-grade evaluations of LLMs across a wide array of benchmarks on various execution backends.

Funktionen

Evaluate LLMs across 100+ benchmarks
Support for 18+ evaluation harnesses
Multi-backend execution (Docker, Slurm, Cloud)
Reproducible containerized evaluations
Export results to MLflow, W&B, or local JSON

Anwendungsfälle

Benchmarking LLMs on standard academic tasks (MMLU, HumanEval, GSM8K)
Evaluating model performance on Slurm HPC clusters for large-scale experiments
Running reproducible LLM evaluations in a local Docker environment
Comparing multiple LLMs on the same set of tasks

Nicht-Ziele

Training or fine-tuning LLMs
Deploying LLMs for inference (though it can connect to deployed models)
Developing new LLM benchmarks or harnesses

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

Nemo Evaluator Sdk

Funktionen

Anwendungsfälle

Nicht-Ziele

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

NeMo Evaluator SDK

Context Compression

Evaluating Llms Harness

Lm Evaluation Harness

BigCode Evaluation Harness

Benchmark