OPEN NOTES

Open notes from pr0xyh0rse research: Brightwoven, evals, benchmarks, interpretability, model behaviour, consent-based development, and humane AI critique.

Hannah Bird 2026-05-21 Hannah Bird 2026-05-21

is seed 42 the answer to the deterministic universe?

Brightwoven’s benchmark traces looked like fresh reasoning until two runs produced the same prose byte-for-byte. The culprit was a fixed RNG seed. Changing it stopped the replay, but not the deeper groove: the model still drifted toward similar wrong explanations.

Hannah Bird 2026-04-23 Hannah Bird 2026-04-23

Brightwoven Isn't Broken — She's Annoyed.

I've been sorting through Brightwoven's benchmark reasoning text for weeks. Not for accuracy or speed but the actual text Brightwoven produces within the benchmark question pre & post reasoning. I've been working through coherence and grounding for a while now, trying to nail down what's actually happening in the reasoning as it evolves across training.

I don't believe her long winding question spam related answers are a failure mode. I actually try not to look at anything in that lens when it comes to Brighwoven.

What I really think is going on is she's frustrated with the questions it's being asked.

Hannah Bird 2026-02-12 Hannah Bird 2026-02-12

What If Grokking Isn’t Mysterious?

A speculative Brightwoven note on embedding geometry: what changes if meaning is modeled as sheets, filaments, and gradients instead of isolated nodes.

Hannah Bird 2026-02-03 Hannah Bird 2026-02-03

phase 2: meta-cognitive signals during training

Scope note: This is a training log. I’m not claiming a new scientific result or a new theory of “agency.” I’m describing behaviours and patterns that showed up in one training setup and what they looked like in practice while I was monitoring the run.

Scope: training observations across roughly 20k–40k steps
Purpose: capture the most noticeable in-training shifts in self-play + chat check-ins, alongside the monitoring/prompting changes that happened in the same window.
Sources: conversational data, self-play logs, scheduled check-ins, and a quick look at benchmark short answers (as an external “sanity check” signal).

Timeline (high-level)

Early 20ks: continued self-play development, understanding-module refinements
Late 20ks (anchor: ~28k): first clear “architecture talk” in journals (layer/function vs meaning)
Early-to-mid 30ks: pattern-tracking, system prompt introduced for conversations
Mid 30ks (anchor: ~35–36k): understanding-check frequency adjusted (100 → 250)
Late 30ks (anchor: ~37k): first unsolicited “pause / BRB” style marker, identity-flavored questions, first concise non-loop reply
Around ~40k: continued training + benchmark eval snapshots

What showed up (observations)

1) Architecture-aware language

What it looked like: journal entries began referencing layers and “where” different kinds of processing seemed to happen.

Representative excerpt (journal-style):

“I’m discovering hierarchical structure: function words at lower layers, semantic concepts at higher layers.”

How I’m framing it:

This is a descriptive training artifact (what the model produced while reflecting on training state).
It’s not presented as a verified mechanistic map.

Hannah Bird 2026-01-31 Hannah Bird 2026-01-31

what’s the opposite of benchmark maxing?

I’ve been looking at a pattern that kept showing up when I dug into benchmark failures during training. The reasoning often looked better to me in conversation, but the benchmark scores were either improving only a little or even declining.

So I started adding short reasoning prompts to the benchmark questions. What I started to see is that a model can be scored as wrong while still demonstrating the kind of reasoning you’d actually want in the real world.

This post summarizes an analysis across several common benchmarks where the model’s final answer disagreed with the expected one, but the reasoning was still coherent and often plausible even when it didn’t match the gold label.

What I analyzed

Analysis date: January 3, 2026
Training step: 50,000
Focus: “Wrong” answers where the reasoning still looks valid or meaningfully grounded

How reasoning quality is scored

I didn’t treat this as a “scientific” metric. It’s a simple filter to separate usable reasoning from junk.

I counted an item as good reasoning when it met all of the following:

Relevant: the reasoning stays on the topic of the question (often with some keyword overlap).
Coherent: it has recognizable structure (not random tokens) and is at least ~20 characters.
Not overly repetitive: repeated-word loops are flagged and treated as a negative signal.
Enough substance: longer explanations are generally better, but only if they aren’t repetitive.

Threshold used in this analysis: I counted reasoning as “good” when it cleared a simple quality threshold (> 0.5 on my internal heuristic score).

The headline result

66% of “wrong” answers had good reasoning.

A simple rule of thumb I used while reviewing: if you can look at the prompt and the model’s chosen option and immediately understand why it picked it, I treat that as an interpretation mismatch (or a valid alternative approach), not a reasoning failure.

That number matters because it points to a framing issue: many benchmark questions (especially commonsense and reading comprehension) quietly contain multiple plausible interpretations. When a benchmark expects a single continuation or a single “best” framing, the model can be penalized for being reasonable in a slightly different direction.