what’s the opposite of benchmark maxing?
I’ve been looking at a pattern that kept showing up when I dug into benchmark failures during training. The reasoning often looked better to me in conversation, but the benchmark scores were either improving only a little or even declining.
So I started adding short reasoning prompts to the benchmark questions. What I started to see is that a model can be scored as wrong while still demonstrating the kind of reasoning you’d actually want in the real world.
This post summarizes an analysis across several common benchmarks where the model’s final answer disagreed with the expected one, but the reasoning was still coherent and often plausible even when it didn’t match the gold label.
What I analyzed
Analysis date: January 3, 2026
Training step: 50,000
Focus: “Wrong” answers where the reasoning still looks valid or meaningfully grounded
How reasoning quality is scored
I didn’t treat this as a “scientific” metric. It’s a simple filter to separate usable reasoning from junk.
I counted an item as good reasoning when it met all of the following:
Relevant: the reasoning stays on the topic of the question (often with some keyword overlap).
Coherent: it has recognizable structure (not random tokens) and is at least ~20 characters.
Not overly repetitive: repeated-word loops are flagged and treated as a negative signal.
Enough substance: longer explanations are generally better, but only if they aren’t repetitive.
Threshold used in this analysis: I counted reasoning as “good” when it cleared a simple quality threshold (> 0.5 on my internal heuristic score).
The headline result
66% of “wrong” answers had good reasoning.
A simple rule of thumb I used while reviewing: if you can look at the prompt and the model’s chosen option and immediately understand why it picked it, I treat that as an interpretation mismatch (or a valid alternative approach), not a reasoning failure.
That number matters because it points to a framing issue: many benchmark questions (especially commonsense and reading comprehension) quietly contain multiple plausible interpretations. When a benchmark expects a single continuation or a single “best” framing, the model can be penalized for being reasonable in a slightly different direction.