Is Claude Dumb Today?

Daily HumanEvalPlus-CC164 benchmark for Claude Code (Opus 4.8)

...

Loading latest results…

Score

—

Model

—

Cost

—

Runtime

—

Score History (last 90 runs)

Where the models disagree

Tasks where Opus 4.8, 4.7, and 4.6 have different pass rates over recent paired runs. Green = always passes, red = always fails. Spread is the gap between the best and worst model on that task — a high spread reveals a real tradeoff, not noise. Historical divergences include HumanEval/97 (Python signed-modulo quirk) and HumanEval/141 (Unicode .isalpha() vs literal a–z range).

Task	Opus 4.8	Opus 4.7	Opus 4.6	Spread
Loading…

Per-Task Results (latest run)

Task	Function	Result	Base	EvalPlus	Attempts	Turns	Cost	Error
Loading…