Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Evaluating Llms Harness

Skill Verifiziert Aktiv

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Zweck

To provide a standardized, reproducible, and comprehensive framework for evaluating the quality and capabilities of Large Language Models using established academic benchmarks.

Funktionen

Evaluates LLMs across 60+ academic benchmarks
Supports HuggingFace, vLLM, and API-based models
Provides detailed CLI and workflow examples
Facilitates model comparison and training progress tracking
Includes guidance on distributed evaluation and cost management

Anwendungsfälle

Benchmarking model quality for research or deployment
Comparing the performance of different LLMs
Reporting standardized academic results
Tracking the progress of LLM training

Nicht-Ziele

Fine-tuning LLMs
Deploying LLMs
General-purpose code analysis or debugging
Evaluating non-LLM AI models

Trust

info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating maintainers are active but response times may be slow for some issues.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert

99 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago

GitHub-Inhaber davila7

Sterne27.2k

Downloads 23k

LizenzMIT

Websiteaitmpl.com

Status

Quellcode ansehen

Ähnliche Erweiterungen

Lm Evaluation Harness

Skill

Orchestra-Research

Context Compression

100

This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.

Skill

muratcankoylan

Scholar Evaluation

Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and writing with quantitative scoring and actionable feedback.

Skill

K-Dense-AI

Nemo Evaluator Sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill

Orchestra-Research

BigCode Evaluation Harness

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill

Orchestra-Research

Literature Review

100

Conduct comprehensive, systematic literature reviews using multiple academic databases (PubMed, arXiv, bioRxiv, Semantic Scholar, etc.). This skill should be used when conducting systematic literature reviews, meta-analyses, research synthesis, or comprehensive literature searches across biomedical, scientific, and technical domains. Creates professionally formatted markdown documents and PDFs with verified citations in multiple citation styles (APA, Nature, Vancouver, etc.).

Skill

K-Dense-AI