Evaluating Llms Harness
技能 已验证 活跃Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
To provide a standardized, reproducible, and comprehensive framework for evaluating the quality and capabilities of Large Language Models using established academic benchmarks.
功能
- Evaluates LLMs across 60+ academic benchmarks
- Supports HuggingFace, vLLM, and API-based models
- Provides detailed CLI and workflow examples
- Facilitates model comparison and training progress tracking
- Includes guidance on distributed evaluation and cost management
使用场景
- Benchmarking model quality for research or deployment
- Comparing the performance of different LLMs
- Reporting standardized academic results
- Tracking the progress of LLM training
非目标
- Fine-tuning LLMs
- Deploying LLMs
- General-purpose code analysis or debugging
- Evaluating non-LLM AI models
Trust
- info:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating maintainers are active but response times may be slow for some issues.
安装
npx skills add davila7/claude-code-templates通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。
质量评分
已验证类似扩展
Lm Evaluation Harness
98Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Context Compression
100This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits.
Scholar Evaluation
98Systematically evaluate scholarly work using the ScholarEval framework, providing structured assessment across research quality dimensions including problem formulation, methodology, analysis, and writing with quantitative scoring and actionable feedback.
Nemo Evaluator Sdk
98Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
BigCode Evaluation Harness
98Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
Literature Review
100Conduct comprehensive, systematic literature reviews using multiple academic databases (PubMed, arXiv, bioRxiv, Semantic Scholar, etc.). This skill should be used when conducting systematic literature reviews, meta-analyses, research synthesis, or comprehensive literature searches across biomedical, scientific, and technical domains. Creates professionally formatted markdown documents and PDFs with verified citations in multiple citation styles (APA, Nature, Vancouver, etc.).