Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

NeMo Evaluator SDK

Skill Verifiziert Aktiv

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Zweck

To provide a scalable and reproducible platform for evaluating LLMs against a wide range of benchmarks, supporting enterprise needs for benchmarking on various computing infrastructures.

Funktionen

  • Evaluate LLMs across 100+ benchmarks
  • Supports 18+ evaluation harnesses (MMLU, HumanEval, VLM, safety)
  • Multi-backend execution (Docker, Slurm, Cloud)
  • Reproducible containerized evaluation
  • Enterprise-grade platform with result export (MLflow, W&B)

Anwendungsfälle

  • Running scalable LLM evaluations on local Docker instances
  • Benchmarking LLMs on Slurm HPC clusters
  • Comparing multiple models on standard academic and industry benchmarks
  • Ensuring reproducible LLM evaluations through containerization

Nicht-Ziele

  • Training or fine-tuning LLMs
  • Providing raw model APIs
  • General-purpose code generation or analysis beyond benchmark tasks

Workflow

  1. Configure evaluation parameters (execution backend, model endpoint, tasks)
  2. Select benchmarks and optionally override parameters per task
  3. Launch evaluation via CLI or Python API
  4. Monitor job status and retrieve results
  5. Export results for comparison and analysis

Praktiken

  • Benchmarking
  • LLM Evaluation
  • Reproducible Computing
  • Distributed Systems

Voraussetzungen

  • Docker installed and running (for local execution)
  • SSH access to Slurm cluster (for Slurm execution)
  • NGC API Key (for container pulls and NVIDIA services)
  • HF_TOKEN (for some benchmarks)

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Nemo Evaluator Sdk

98

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill
Orchestra-Research

Azure Container Registry SDK for Python

100

Azure Container Registry SDK for Python. Use for managing container images, artifacts, and repositories. Triggers: "azure-containerregistry", "ContainerRegistryClient", "container images", "docker registry", "ACR".

Skill
microsoft

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill
muratcankoylan

Evaluating Llms Harness

99

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill
davila7

Lm Evaluation Harness

98

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill
Orchestra-Research

BigCode Evaluation Harness

98

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill
Orchestra-Research