Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

NeMo Evaluator SDK

Skill Verifiziert Aktiv

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Zweck

To provide a scalable and reproducible platform for evaluating LLMs against a wide range of benchmarks, supporting enterprise needs for benchmarking on various computing infrastructures.

Funktionen

Evaluate LLMs across 100+ benchmarks
Supports 18+ evaluation harnesses (MMLU, HumanEval, VLM, safety)
Multi-backend execution (Docker, Slurm, Cloud)
Reproducible containerized evaluation
Enterprise-grade platform with result export (MLflow, W&B)

Anwendungsfälle

Running scalable LLM evaluations on local Docker instances
Benchmarking LLMs on Slurm HPC clusters
Comparing multiple models on standard academic and industry benchmarks
Ensuring reproducible LLM evaluations through containerization

Nicht-Ziele

Training or fine-tuning LLMs
Providing raw model APIs
General-purpose code generation or analysis beyond benchmark tasks

Workflow

Configure evaluation parameters (execution backend, model endpoint, tasks)
Select benchmarks and optionally override parameters per task
Launch evaluation via CLI or Python API
Monitor job status and retrieve results
Export results for comparison and analysis

Praktiken

Benchmarking
LLM Evaluation
Reproducible Computing
Distributed Systems

Voraussetzungen

Docker installed and running (for local execution)
SSH access to Slurm cluster (for Slurm execution)
NGC API Key (for container pulls and NVIDIA services)
HF_TOKEN (for some benchmarks)

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago

GitHub-Inhaber davila7

Sterne27.2k

Downloads 23k

LizenzMIT

Websiteaitmpl.com

Status

Quellcode ansehen

NeMo Evaluator SDK

Funktionen

Anwendungsfälle

Nicht-Ziele

Workflow

Praktiken

Voraussetzungen

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Nemo Evaluator Sdk

Azure Container Registry SDK for Python

Context Compression

Evaluating Llms Harness

Lm Evaluation Harness

BigCode Evaluation Harness