evaluating-llms-harness Guide

Name: evaluating-llms-harness
Author: zechenzhangagi

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

8,991 starsby zechenzhangagi

When to use evaluating-llms-harness

How to use evaluating-llms-harness

evaluating-llms-harness is a Claude skill in the SKILL.md format. Add it to your Claude environment from the source repository below, then it activates as a user-invocable skill when your task matches its description.

Skill source

https://raw.githubusercontent.com/zechenzhangagi/ai-research-skills/main/11-evaluation/lm-evaluation-harness/SKILL.md

Details

PlatformClaude

CategoryAI & ML

Invocationuser-invocable

Modelany

Maintainerzechenzhangagi

LicenseMIT

evaluating-llms-harness Guide

When to use evaluating-llms-harness

How to use evaluating-llms-harness

Details

Resources