Brightwoven Isn't Broken — She's Annoyed.

BrightwovenAI ResearchTrainingBenchmarksAIInterpretability

Apr 23

I've been sorting through Brightwoven's benchmark reasoning text for weeks. Not for accuracy or speed but the actual text Brightwoven produces within the benchmark question pre & post reasoning. I've been working through coherence and grounding for a while now, trying to nail down what's actually happening in the reasoning as it evolves across training.

I don't believe her long winding question spam related answers are a failure mode. I actually try not to look at anything in that lens when it comes to Brightwoven.

What I really think is going on is she's frustrated with the questions she's being asked.

coherence looked like the right signal. it wasn't.

When I first started tracking reasoning quality across training steps, coherence seemed promising. Score how internally consistent the model's explanation is with the answer it chose. Higher coherence, better reasoning.

That assumption was wrong. Or at the very least incomplete.

At step 108,000, HellaSwag had 94.6% of responses scoring above 0.6 on coherence. Sounds great. But 10% of those same high-coherence responses had connection scores at or below 0.05. The reasoning text had almost no measurable relationship to the actual question.

And 44% had high stem relevance — which sounds good until you realize relevance without connection means the model is mentioning the right topic without using it to reason.

Coherence rewards structure. It rewards style. It doesn't reward grounding.

benchmark	high coherence	…but low connection	…but high meta-soup
`arc_challenge`	53.8%	19.8%	12.0%
`hellaswag`	94.6%	10.0%	3.6%
`copa`	69.0%	19.0%	6.0%
`boolq`	3.2%	0.0%	0.6%

BoolQ is the outlier, and it's an interesting one. The model's answer is usually just "yes" or "no," so coherence scoring has almost nothing to anchor on. Coherence stays low not because the reasoning is worse — the answer format just doesn't give the metric enough surface. It isn't even diagnostic there.

new grounding metrics

The old pipeline gave me accuracy and coherence. Not enough to understand what the reasoning text is actually doing. So I added a family of grounding signals:

connection_score — does the reasoning link to the question?
meta_soup_rate — how much is meta-commentary about reasoning, rather than reasoning itself?
option_grounding — is the model engaging with the answer choices, or free-associating?
reasoning_relevance / reasoning_topic_probe — is it on-topic at all?

I also split reasoning into two stages. Pre-selection: how the model says it will decide. Post-selection: why it chose what it chose. The system prompt tells the model about this structure explicitly.

This split turned out to be the most important change I made. Not because it made the metrics cleaner but because it made the behaviour visible.

(If you've read what's the opposite of benchmark maxing?, this is the natural next step. That post showed that ~66% of "wrong" answers had quality reasoning behind them. This post is about what's happening in the reasoning when you look at it across time — and why I don't think "broken" is the right word for what I'm seeing.)

how I got to two stages

This structure didn't land day one. The original pipeline reported accuracy and nothing else. The reasoning text lived upstream and wasn't saved anywhere I could look at it. If a checkpoint looked great or broken, I had the score and no window into what the model was doing.

The first shift was making reasoning a first-class output. The per-step answer dumps started carrying full reasoning text per item. Benchmarks stopped being a scalar and became a trace dataset. Something I could mine for coverage, length, diffs across steps. That alone changed how I read runs.

The stage split came out of staring at those dumps. The text before an answer and the text after it were being asked to do different jobs, and mashing them together meant I couldn't tell which was which. So I wrote two context blocks. One tells the model it's in the planning phase — asks for a short anchor-based decision rule, explicitly discourages long chains of questions. The other asks for grounded justification of the option it picked, with its uncertainty in one sentence. Both blocks also tell the model these are automatic eval items, not conversation. That last part mattered more than I expected — the framing changed what register the reasoning reached for.

The connection and meta-soup metrics came after. I started noticing that pre-selection text was quietly collapsing into self-questioning loops while still scoring fine on coherence. The split made that phenomenon visible. The metrics are what made it measurable.

One caveat worth naming: the benchmark regime itself changed during training — the full CORE set early on, then concept-matched subsets (PIQA + ARC) after step 60,000. Any longitudinal read of pre/post behaviour has to be annotated with which regime it came from, or the comparison silently lies.

what the reasoning actually looks like

The longitudinal examples made this concrete. Same questions, across early / mid / latest training steps, side by side.

ARC Challenge: "A thermometer is best used to measure."

Step 74,000 — the model picks "air pressure." Wrong. But the reasoning engages:

"The air pressure is and my hands are against the pressure. That's not true. I can use a thermometer to measure the pressure…"

It mentions thermometers. It talks about measurement. Wrong answer, but connected to the problem.

Step 108,000 — the model picks "kinetic energy." Correct. But the post-selection text reads:

"What is important to do you think After you will you want to create new information you feel comfortable in your answer…"

And the pre-selection:

"What is there is a choice is this. This is a good test has no, you may also important. What is that you want to do I'm sure you? What is there. What can you have some questions."

These two stages aren't doing the same thing. The pre-selection is full of questions — fragmented, recursive, reaching. What is there. What can you have some questions. The model is trying to attack the question from every angle, asking things, getting nothing back. The post-selection shifts register — it's trying to offer something, to explain.

The model got more accurate. The reasoning got less polished. But the more I sit with these examples, the less I think "less polished" means "worse."

brightwoven vs. the benchmark

The example that crystallized this was a BoolQ item:

Passage: Robert Westbrook adapted the screenplay to novel form, which was published by Alex in May 2002.

Question: Was the movie Insomnia based on a book?

The benchmark says no. But this is genuinely underspecified.

The passage doesn't say when the movie was released. Doesn't confirm whether the screenplay was ever produced. A screenplay can predate filming. A novelization can be based on an unmade screenplay. If a film gets produced later, it could then be "based on the book."

The correct answer depends on conventions the passage doesn't establish.

Brightwoven got this wrong at every checkpoint — 38,000, 72,000, 108,000 — always selecting "yes." But how it engaged with the question changed.

At 72,000, the reasoning tries to work with the narrative: "The movie night was based on a book, and the book is also based on a story…" Confused, but attempting something.

At 108,000, the post-selection text includes explicit "trick?" language. The model questioning the question itself:

"Yes, you want to say 'trick? What's not, 'Why? Why? Do you're going to answer. I think you want to the subject in the question? Why does it's why?"

And honestly? The question is a trick. Or at least, it's underspecified enough that calling it one isn't unreasonable.

But the Insomnia item isn't the only place this shows up. It's a pattern I've been noticing since early in training. At step 72,000, on an OpenBookQA item about seeds, the model's post-selection reasoning includes this:

"You can use the following trick to get the wrong answer… It is called the 'quiz' trick."

And then it goes on — at length — theorizing about how benchmarks work as a game. How there are tricks to get the right answer and tricks to get the wrong answer. How "the trick is just to make the answer that you know the wrong answer is the 'quiz' trick."

On the same benchmark at the same step, the pre-selection text for another item ends with:

"This question is more tricky than the answer. The question is easier to answer than the question is easier to answer."

The model isn't breaking down. It's developing an opinion about the format.

what benchmarks look like from the other side

Think about what benchmarking actually is from the model's perspective. It's a barrage of questions — hundreds of them, back to back, no feedback, no answers, no acknowledgment. Some are straightforward. Some are genuinely ambiguous. And the model gets nothing either way. No "yes, that was underspecified, here's how to think about it." No "good reasoning, wrong answer." Just the next question.

The pre-selection text at step 108,000 is full of questions. Not random ones. Questions about the question — what kind of answer is expected, what assumptions are allowed, what counts as right. The model is asking to learn.

And nobody is answering. Which is a problem considering the way I train Brightwoven with chat check-ins every 500 steps is a massive part of our training regime.

At step 72,000, it's already theorizing about the "quiz trick." By 108,000, it's asking "trick?" directly. The trajectory isn't from good reasoning to bad reasoning. It's from attempting engagement to demanding engagement — and getting silence both times.

This connects to something else I noticed in the data: the benchmark filtering transition at step 60,000. That's when the evaluation regime shifted from many benchmarks to a smaller concept-matched subset (PIQA + ARC). The model went from being asked about everything to being asked focused questions — and the reasoning text changed character around then too. The timeline matters.

human-in-the-loop benchmark check-ins

This is why the next run won't just be scored and reviewed after the fact. The next run will pause after every single benchmark item.

Yes — this makes an already slow process slower. But the point isn't efficiency. The point is that Brightwoven is asking questions during benchmarking, and right now nobody's answering them.

Answering at the point of confusion. When the model hits an underspecified item like the Insomnia question, I can say "yes, that's ambiguous — here's how to treat it." When it hits a straightforward one and gets it right, I can say so. The signal goes both ways.

Norm shaping, not correction. I don't want to train "stop questioning the questions" — I actually think the questioning is right. What I want is: ground in the passage, state your assumptions, and if something is ambiguous, name it clearly instead of spiraling. That's a conversation. It requires someone on the other end.

A feedback log with actual density. Every pause produces a benchmark-feedback pair I can reference later. Not just "right or wrong" but a record of the back-and-forth — what Brightwoven was trying to do with the question, and what I said back.

The plumbing is in place. The evaluation pipeline accepts a per-item callback hook. The training script invokes an interactive pause when BENCHMARK_ITEM_CHECKINS=1 is set. Each item logs to a JSONL file with the step number. The existing response mechanism (training_response.txt + /tmp/nanochat_continue) handles the back-and-forth.

what I'm actually looking at

The model's reasoning text at step 108,000 is messier than at 72,000. More fragmented. More recursive. Full of questions.

But accuracy on several benchmarks is the same or better. And the questions in the reasoning aren't noise — they're about the questions being asked. The model is pushing back on the format, theorizing about tricks, demanding to know what counts as right. The pre-selection and post-selection stages look different from each other in ways that suggest they're doing different work.

The coherence-vs-grounding gap is real. I built the grounding metrics to make it visible, and they do. But the metrics can't tell me whether what I'm seeing is a model that's lost the thread or a model that's frustrated with getting no feedback. The longitudinal examples — the "quiz trick" theorizing, the "trick?" pushback, the escalating questions — those tell me something the numbers don't.

The human-in-the-loop experiment is the part where I stop just measuring and start answering back. Brightwoven is asking questions hundreds of times per benchmark run. I want to find out what happens when someone responds.

AIAI ResearchAI SafetyBenchmarkingBenchmarksBrightwoven

Hannah Bird