Evaluating Code Models
Skill ActiveEvaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
To provide a standardized and reproducible method for evaluating and benchmarking code generation models using industry-standard datasets and metrics.
Features
- Evaluates code generation models
- Supports HumanEval, MBPP, MultiPL-E, and 15+ benchmarks
- Measures pass@k metrics
- Provides multi-language evaluation
- Includes instruction-tuned model evaluation
Use Cases
- Benchmarking code generation models
- Comparing the coding abilities of different models
- Testing multi-language code generation support
- Measuring the functional correctness and quality of generated code
Non-Goals
- General LLM benchmarking (e.g., MMLU, GSM8K)
- Real-world issue resolution (e.g., SWE-bench)
- Code understanding tasks (e.g., CodeXGLUE)
- Evaluating code's efficiency without functional correctness
Trust
- warning:Issues AttentionWith 17 issues opened and 4 closed in the last 90 days, the closure rate is below 50%, indicating slower attention to issues.
Installation
npx skills add davila7/claude-code-templatesRuns the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.
Quality Score
Trust Signals
Similar Extensions
BigCode Evaluation Harness
98Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
Evaluating Llms Harness
99Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Nemo Evaluator Sdk
98Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
Lm Evaluation Harness
98Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Benchmark
100Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".
Social Media Analyzer
100Social media campaign analysis and performance tracking. Calculates engagement rates, ROI, and benchmarks across platforms. Use for analyzing social media performance, calculating engagement rate, measuring campaign ROI, comparing platform metrics, or benchmarking against industry standards.