evaluating-llms-harness

Community

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Claude

8,991 stars Updated 1 months ago

Allowed Tools

This skill does not declare a tool allowlist. The agent host applies whatever default tools are available at runtime.

Source

SKILL.md / Manifest

https://raw.githubusercontent.com/zechenzhangagi/ai-research-skills/main/11-evaluation/lm-evaluation-harness/SKILL.md

Registry

github (via claudemarketplaces.com)

Trust Score

53Fair

Verification10/30

evaluating-llms-harness

Allowed Tools

Source

Trust Score

Details