Evaluating Code Models

Skill Active

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Purpose

To provide a standardized and reproducible method for evaluating and benchmarking code generation models using industry-standard datasets and metrics.

Features

Evaluates code generation models
Supports HumanEval, MBPP, MultiPL-E, and 15+ benchmarks
Measures pass@k metrics
Provides multi-language evaluation
Includes instruction-tuned model evaluation

Use Cases

Benchmarking code generation models
Comparing the coding abilities of different models
Testing multi-language code generation support
Measuring the functional correctness and quality of generated code

Non-Goals

General LLM benchmarking (e.g., MMLU, GSM8K)
Real-world issue resolution (e.g., SWE-bench)
Code understanding tasks (e.g., CodeXGLUE)
Evaluating code's efficiency without functional correctness

Trust

warning:Issues AttentionWith 17 issues opened and 4 closed in the last 90 days, the closure rate is below 50%, indicating slower attention to issues.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

95 /100

Analyzed 1 day ago

Trust Signals

Last commit1 day ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

BigCode Evaluation Harness

Skill

Orchestra-Research

Evaluating Llms Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill

davila7

Nemo Evaluator Sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill

Orchestra-Research

Lm Evaluation Harness

Skill

Orchestra-Research

Benchmark

100

Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".

Skill

garrytan

Social Media Analyzer

100

Social media campaign analysis and performance tracking. Calculates engagement rates, ROI, and benchmarks across platforms. Use for analyzing social media performance, calculating engagement rate, measuring campaign ROI, comparing platform metrics, or benchmarking against industry standards.

Skill

alirezarezvani