Nemo Evaluator Sdk
Skill Verifiziert AktivEvaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
To enable users to perform scalable, reproducible, and enterprise-grade evaluations of LLMs across a wide array of benchmarks on various execution backends.
Funktionen
- Evaluate LLMs across 100+ benchmarks
- Support for 18+ evaluation harnesses
- Multi-backend execution (Docker, Slurm, Cloud)
- Reproducible containerized evaluations
- Export results to MLflow, W&B, or local JSON
Anwendungsfälle
- Benchmarking LLMs on standard academic tasks (MMLU, HumanEval, GSM8K)
- Evaluating model performance on Slurm HPC clusters for large-scale experiments
- Running reproducible LLM evaluations in a local Docker environment
- Comparing multiple LLMs on the same set of tasks
Nicht-Ziele
- Training or fine-tuning LLMs
- Deploying LLMs for inference (though it can connect to deployed models)
- Developing new LLM benchmarks or harnesses
Installation
Zuerst Marketplace hinzufügen
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
NeMo Evaluator SDK
98Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
Context Compression
100This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.
Evaluating Llms Harness
99Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Lm Evaluation Harness
98Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
BigCode Evaluation Harness
98Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
Benchmark
100Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".