evaluating-code-models Guide

Name: evaluating-code-models
Author: zechenzhangagi

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

8,991 starsby zechenzhangagi

When to use evaluating-code-models

How to use evaluating-code-models

evaluating-code-models is a Claude skill in the SKILL.md format. Add it to your Claude environment from the source repository below, then it activates as a user-invocable skill when your task matches its description.

Skill source

https://raw.githubusercontent.com/zechenzhangagi/ai-research-skills/main/11-evaluation/bigcode-evaluation-harness/SKILL.md

Details

PlatformClaude

CategoryAI & ML

Invocationuser-invocable

Modelany

Maintainerzechenzhangagi

LicenseMIT

evaluating-code-models Guide

When to use evaluating-code-models

How to use evaluating-code-models

Details

Resources