Claude
Code & Development
Trust: 55/100 (Fair)agent-evaluation Guide
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks
38,911 starsby sickn33
When to use agent-evaluation
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks
How to use agent-evaluation
agent-evaluation is a Claude skill in the SKILL.md format. Add it to your Claude environment from the source repository below, then it activates as a user-invocable skill when your task matches its description.
Details
PlatformClaude
CategoryCode & Development
Invocationuser-invocable
Modelany
Maintainersickn33
LicenseMIT