Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Evaluating Code Models

Skill Aktiv

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Zweck

To provide a standardized and reproducible method for evaluating and benchmarking code generation models using industry-standard datasets and metrics.

Funktionen

Evaluates code generation models
Supports HumanEval, MBPP, MultiPL-E, and 15+ benchmarks
Measures pass@k metrics
Provides multi-language evaluation
Includes instruction-tuned model evaluation

Anwendungsfälle

Benchmarking code generation models
Comparing the coding abilities of different models
Testing multi-language code generation support
Measuring the functional correctness and quality of generated code

Nicht-Ziele

General LLM benchmarking (e.g., MMLU, GSM8K)
Real-world issue resolution (e.g., SWE-bench)
Code understanding tasks (e.g., CodeXGLUE)
Evaluating code's efficiency without functional correctness

Trust

warning:Issues AttentionWith 17 issues opened and 4 closed in the last 90 days, the closure rate is below 50%, indicating slower attention to issues.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

95 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit1 day ago

GitHub-Inhaber davila7

Sterne27.2k

Downloads 23k

LizenzMIT

Websiteaitmpl.com

Status

Quellcode ansehen

Evaluating Code Models

Funktionen

Anwendungsfälle

Nicht-Ziele

Trust

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

BigCode Evaluation Harness

Evaluating Llms Harness

Nemo Evaluator Sdk

Lm Evaluation Harness

Benchmark

Social Media Analyzer