Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

BigCode Evaluation Harness

Skill Verifiziert Aktiv

Teil von:Agent Native Research Artifact (ARA) Tooling

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Zweck

To provide a standardized and reproducible method for evaluating the code generation capabilities of AI models.

Funktionen

Evaluates code generation models
Supports HumanEval, MBPP, MultiPL-E, and 15+ other benchmarks
Measures performance using pass@k metrics
Includes multi-language evaluation (18 languages)
Provides examples for common workflows and model configurations

Anwendungsfälle

Benchmarking new or existing code generation models
Comparing the coding abilities of different AI models
Testing multi-language support in code generation
Measuring the functional correctness and quality of AI-generated code

Nicht-Ziele

Evaluating general LLM capabilities beyond code generation
Performing code analysis or linting
Executing arbitrary user code outside of defined benchmark tasks

Workflow

Choose benchmark suite (HumanEval, MBPP, MultiPL-E, etc.)
Configure model and generation parameters (e.g., temperature, n_samples)
Run evaluation using `accelerate launch main.py`
Analyze generated metrics (pass@k results) from output files

Voraussetzungen

Python 3.7.10 (for DS-1000)
Docker (for MultiPL-E evaluation)
CUDA-enabled GPU (recommended for model inference)
PyTorch (specific versions may be required)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert

98 /100

Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago

GitHub-Inhaber Orchestra-Research

Sterne8.3k

Downloads 0

LizenzMIT

Websiteorchestra-research.com

Status

Quellcode ansehen

BigCode Evaluation Harness

Funktionen

Anwendungsfälle

Nicht-Ziele

Workflow

Voraussetzungen

Qualitätspunktzahl

Vertrauenssignale

Ähnliche Erweiterungen

Evaluating Llms Harness

Nemo Evaluator Sdk

Lm Evaluation Harness

Evaluating Code Models

Benchmark

Social Media Analyzer