Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

BigCode Evaluation Harness

Skill Verifiziert Aktiv

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Zweck

To provide a standardized and reproducible method for evaluating the code generation capabilities of AI models.

Funktionen

  • Evaluates code generation models
  • Supports HumanEval, MBPP, MultiPL-E, and 15+ other benchmarks
  • Measures performance using pass@k metrics
  • Includes multi-language evaluation (18 languages)
  • Provides examples for common workflows and model configurations

Anwendungsfälle

  • Benchmarking new or existing code generation models
  • Comparing the coding abilities of different AI models
  • Testing multi-language support in code generation
  • Measuring the functional correctness and quality of AI-generated code

Nicht-Ziele

  • Evaluating general LLM capabilities beyond code generation
  • Performing code analysis or linting
  • Executing arbitrary user code outside of defined benchmark tasks

Workflow

  1. Choose benchmark suite (HumanEval, MBPP, MultiPL-E, etc.)
  2. Configure model and generation parameters (e.g., temperature, n_samples)
  3. Run evaluation using `accelerate launch main.py`
  4. Analyze generated metrics (pass@k results) from output files

Voraussetzungen

  • Python 3.7.10 (for DS-1000)
  • Docker (for MultiPL-E evaluation)
  • CUDA-enabled GPU (recommended for model inference)
  • PyTorch (specific versions may be required)

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
98 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Evaluating Llms Harness

99

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill
davila7

Nemo Evaluator Sdk

98

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Skill
Orchestra-Research

Lm Evaluation Harness

98

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Skill
Orchestra-Research

Evaluating Code Models

95

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

Skill
davila7

Benchmark

100

Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".

Skill
garrytan

Social Media Analyzer

100

Social media campaign analysis and performance tracking. Calculates engagement rates, ROI, and benchmarks across platforms. Use for analyzing social media performance, calculating engagement rate, measuring campaign ROI, comparing platform metrics, or benchmarking against industry standards.

Skill
alirezarezvani