NeMo Evaluator SDK

Skill Verified Active

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Purpose

To provide a scalable and reproducible platform for evaluating LLMs against a wide range of benchmarks, supporting enterprise needs for benchmarking on various computing infrastructures.

Features

Evaluate LLMs across 100+ benchmarks
Supports 18+ evaluation harnesses (MMLU, HumanEval, VLM, safety)
Multi-backend execution (Docker, Slurm, Cloud)
Reproducible containerized evaluation
Enterprise-grade platform with result export (MLflow, W&B)

Use Cases

Running scalable LLM evaluations on local Docker instances
Benchmarking LLMs on Slurm HPC clusters
Comparing multiple models on standard academic and industry benchmarks
Ensuring reproducible LLM evaluations through containerization

Non-Goals

Training or fine-tuning LLMs
Providing raw model APIs
General-purpose code generation or analysis beyond benchmark tasks

Workflow

Configure evaluation parameters (execution backend, model endpoint, tasks)
Select benchmarks and optionally override parameters per task
Launch evaluation via CLI or Python API
Monitor job status and retrieve results
Export results for comparison and analysis

Practices

Benchmarking
LLM Evaluation
Reproducible Computing
Distributed Systems

Prerequisites

Docker installed and running (for local execution)
SSH access to Slurm cluster (for Slurm execution)
NGC API Key (for container pulls and NVIDIA services)
HF_TOKEN (for some benchmarks)

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified

98 /100

Analyzed 1 day ago

Trust Signals

Last commit1 day ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

Nemo Evaluator Sdk

Skill

Orchestra-Research

Azure Container Registry SDK for Python

100

Azure Container Registry SDK for Python. Use for managing container images, artifacts, and repositories. Triggers: "azure-containerregistry", "ContainerRegistryClient", "container images", "docker registry", "ACR".

Skill

microsoft

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill

muratcankoylan

Evaluating Llms Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill

davila7

Lm Evaluation Harness

Skill

Orchestra-Research

BigCode Evaluation Harness

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill

Orchestra-Research