此内容尚未提供您的语言版本,正在以英文显示。

Evaluating Code Models

技能活跃

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

目的

To provide a standardized and reproducible method for evaluating and benchmarking code generation models using industry-standard datasets and metrics.

功能

Evaluates code generation models
Supports HumanEval, MBPP, MultiPL-E, and 15+ benchmarks
Measures pass@k metrics
Provides multi-language evaluation
Includes instruction-tuned model evaluation

使用场景

Benchmarking code generation models
Comparing the coding abilities of different models
Testing multi-language code generation support
Measuring the functional correctness and quality of generated code

非目标

General LLM benchmarking (e.g., MMLU, GSM8K)
Real-world issue resolution (e.g., SWE-bench)
Code understanding tasks (e.g., CodeXGLUE)
Evaluating code's efficiency without functional correctness

Trust

warning:Issues AttentionWith 17 issues opened and 4 closed in the last 90 days, the closure rate is below 50%, indicating slower attention to issues.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

95 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

BigCode Evaluation Harness

技能

Orchestra-Research

Evaluating Llms Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

技能

davila7

Nemo Evaluator Sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

技能

Orchestra-Research

Lm Evaluation Harness

技能

Orchestra-Research

Benchmark

100

Performance regression detection using the browse daemon. Establishes baselines for page load times, Core Web Vitals, and resource sizes. Compares before/after on every PR. Tracks performance trends over time. Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time". (gstack) Voice triggers (speech-to-text aliases): "speed test", "check performance".

技能

garrytan

Social Media Analyzer

100

Social media campaign analysis and performance tracking. Calculates engagement rates, ROI, and benchmarks across platforms. Use for analyzing social media performance, calculating engagement rate, measuring campaign ROI, comparing platform metrics, or benchmarking against industry standards.

技能

alirezarezvani