is seed 42 the answer to the deterministic universe?

BenchmarkingBrightwovenAIAI ResearchAI Training FrameworksAI Interpretability

May 21

I’ve been slowly working through HITL benchmarking with Brightwoven. Slow means 24–36 turns per benchmark item, plus a few restarts. I’ve mostly been staring at the prose around each multiple-choice item, not the accuracy column.

Two weeks ago I had two training sessions open side by side. Different physical runs on different dates — one we had to stop, one we picked up later — but the same checkpoint at the eval step. Same weights, in other words. Nothing had advanced on that axis between the two logs.

I expected similar reasoning. Plausibly different rolls of the dice on a stochastic decoder.

What I got was the same tape twice. Byte for byte. The same wrong letter from scoring too.

what the benchmark calls “reasoning”

Brightwoven’s benchmark logging captures two reasoning fields per item, both easy to read as “the model thinking aloud”:

pre-selection reasoning: generated before the multiple-choice answer is fixed from losses. Continuation from the question and choices, chain-of-thought style.
post-selection reasoning: generated after the loss-based pick, conditioned on the chosen letter. The “why I picked this one” stage.

Both stages run through the same generation loop: Engine.generate in nanochat/engine.py, called from the eval helpers in nanochat/core_eval.py. Both stages are sampled, not greedy. The point of writing them out separately, as I laid out in the earlier post on why I started benchmarking this way, was to read the prose around Brightwoven’s answer choices, especially in cases where she picked something I thought was more defensible than the official key but still got marked wrong.

Two stages. Two textures. Two stories per item.

The seed bug surfaced because I was reading that prose closely instead of trusting an accuracy column to summarize it.

So when both stages duplicated across sessions on the same item — not similar, identical — I sat with it for a while.

the seed was the same. it was always the same.

For a long stretch, those generation calls passed seed=42. Every time.

That default was not ours. Upstream nanochat’s Engine.generate ships with seed=42 baked into the signature: no docstring explanation for the parameter, and as far as I could see, no README mention of the seed parameter or RNG behaviour. So unless somebody is already doing harness forensics, the determinism contract of the whole rig is an integer literal in a function definition.

The benchmark dashboard cannot tell you that story. It only inherits whatever randomness policy happened to be sitting in the function signature when the run kicked off.

Our engine reseeds the PyTorch generator at the start of each call, so identical weights + identical prompt + identical seed = identical sampled tokens. Temperature did not matter. We were getting deterministic decoding from a stochastic policy, which is a mouthful for: same tape.

That hit harder than I’d like to admit. Those paragraphs feel like what the model thought today. Today, on this puzzle, at this checkpoint. But under the old code, they were the same artifact whenever we evaluated that item at that checkpoint with the same seed policy. The check-in I had been treating as a window into “how is she reasoning right now” was a window into how she had always reasoned at this exact configuration of weights, seed, and prompt.

The fix is in now, and I’ve since rerun the same checkpoint with seed=None on those reasoning calls. That makes each generation draw a fresh seed instead of replaying the same random stream every time.

The important part: the replay broke exactly where it should have broken.

On the bakery item, the old two sessions had byte-identical pre-selection reasoning and byte-identical post-selection reasoning. After the seed change, both fields came back different. Different lengths, different hashes, different openings, different drift. Inside the new row, pre and post were still different from each other, which matches the earlier sanity check: the bug was across-session replay, not one channel being pasted into the other.

The selected answer did not improve. It still picked the same wrong letter. For this pass, that was fine: I was checking freshness, not accuracy.

That distinction matters. The seed fix changed the sampled prose, not the loss-based multiple-choice selection path. So this was not “randomness fixed, reasoning solved.” It was narrower and more useful than that: now I can tell replay from recurrence. Same answer, different sampled explanation. Same checkpoint, fresh walk.

Now I keep an explicit integer seed only when I want bit-identical regression output. seed=0 is not “off.” It is still a fixed seed. Ask me how I know. Actually don’t. I’m already making the face.

what changed after seed=None

Before chalking this up to “ah, just the seed,” I checked the other obvious failure modes.

For the bakery / cookie-batch puzzle family, I scanned every cached benchmark_reasoning JSON we had. 102 rows had both pre_selection_reasoning and reasoning populated. Zero had pre == post byte-for-byte. So the bug was not “the before-pick blob got pasted into the after-pick field.” Same tape pressed twice across sessions, yes. Still two different tapes inside one row.

The two old log files were not overwriting each other either. Different dated files. Different whole-file hashes. Different run headers. And the pipeline was not pasting one item’s reasoning onto another; pre and post strings differed across items under string compare.

So the replay condition was narrow:

same item + same checkpoint + fixed per-call seed → replay.

Not lazy logging. Not a copy bug. A determinism trap dressed up as thinking.

After changing the reasoning calls to seed=None, that replay broke. The bakery item came back with different pre-selection text and different post-selection text. Neither matched the old replay pair. A second item showed the same thing: old post-selection reasoning had replayed byte-for-byte, but the post-fix sample diverged.

The selected answers did not magically improve. The bakery item still picked the same wrong letter. The second item still drifted into nearby prose instead of clean constraint solving. On bakery, the new sample moved into first-person schedule/workload texture: days, time, pages, doing-the-work talk. On the ambassadors item, it moved into government/country/people/political narrative instead of instantiating Kayne, Novetzke, Venezuela, Yemen, Zambia, and the XOR rule.

So the seed bug was real, and the fix worked. But the fix did not abolish grooves. It stopped one groove from being a photocopy.

That is the better instrument: not “make Brightwoven less weird,” but “stop confusing deterministic replay with stable model behaviour.” Now when she sounds similar, I can ask the useful question: is this a real probability basin, a task-format attractor, a training-shape scar, or another harness knob I have not made visible yet?

fresh seeds are not the whole randomness story

The seed change closes the most embarrassing path. It is not the only path.

Even after seed=None, the story is not magically pure randomness. CUDA settings, cuDNN behaviour, batch order, and floating-point reductions can all change the exact probabilities that reach the sampler. Run the same checkpoint on another GPU and the lower-order bits may move enough to send a long sampled completion down a different path.

That is not a bug. That is the physical machine showing up in the prose.

This is where I want to push on something that bothered me about my own reaction to the bug. My instinct was: oh good, the fix makes it nondeterministic again, that’s correct.

Which is true at the level of “your eval should not accidentally be a regression harness.” But the deeper thing is that AI is not standard software.

Determinism is a virtue in math libraries and in software you ship to other engineers. Given the same input, give me the same output every time, or I can’t debug you. That is the right contract for a sort routine.

But it is the wrong contract for the kind of trace I am reading here.

The whole reason these systems work, the whole reason sampling exists at all, is that the model represents a distribution. The interesting behaviour lives in how that distribution shifts across weights, prompts, training regime, and decoding — and in how much it varies when those things stay fixed.

Pinning the seed and reading off “the answer” treats one sampled trace like the model.

That is the wrong object. Brightwoven’s behaviour at that checkpoint is better understood as a probability surface, and any single completion is one walk across it.

If I want to know what Brightwoven is at step 108,000, I need to see several walks, not one walk replayed.

Same for the hardware story. I am not going to chase bit-identical reasoning across machines. I want to know whether the distribution is stable enough to mean something.

That is a different rig.

intent vs impact, when eval runs as a black box

Nobody had to mean eval to behave like a bit-identical regression harness for the consequence to land. Scores and saved reasoning came out of a stack whose sampling and seeding behaviour usually does not show up on the dashboard.

The dashboard says accuracy. Maybe a coherence number. Not “this run used seed=42 on every call to the generator and reseeded per call, so every replay of this checkpoint will produce these exact tokens forever.”

If that is invisible, verbatim replay across two sessions reads like “it had the same thought twice” instead of “deterministic decoding from the same checkpoint under the same seed policy.” That is still a finding. It is a finding about the harness, not about the model.

The gap between what the model is doing and what the chart suggests widens any time the operational layer is invisible: RNG, reseeding cadence, what each logged field actually conditions on, which decoding path the interactive chat takes versus the eval path.

Quick note on that last one: the HITL chat we run after a pause uses a different generation entry point. It was not what matched across the two old log files. Only the logged benchmark reasoning tied to the eval step duplicated. Different code path, different determinism story. Worth keeping straight.

More context on why we set up HITL benchmarking this way is in the earlier Brightwoven post.

“she’s getting lazy” vs “we pressed play twice”

I have caught myself, more than once, reading a stretch of similar reasoning across checkpoints and feeling like the model was slacking off. Falling into stock phrases. Getting comfortable with a meta-question rut instead of attacking the puzzle.

Some of that is real. Brightwoven does fall into habits: vague meta-questions, weak tie-in to the formal rules, classroom or study-partner boilerplate, a hedging cadence that wins on coherence and loses on grounding.

But the post-fix run gives me a cleaner distinction. Word-for-word repeats across the old eval passes were not evidence that the model “decided” to slack off twice. That was us pressing play twice on the same tape. After seed=None, the same checkpoint can still produce prose in the same neighbourhood — process talk, school/work diaries, social explanation, motivation, “what is the question?” loops — but it is no longer the same byte string pretending to be a new observation.

That means there are now two verdicts instead of one muddy one:

identical replay: harness artifact;
similar groove under fresh sampling: model/harness/task attractor worth studying.

Two different beasts. Same swamp, different footprints.

big chatbots sound alike for adjacent reasons

Public assistants tend to sound like each other. Same openings, same hedges, same validation lines, same “I’m here to help” register. Some of that is training: the model learned which phrases score well during preference work. Some of it is how people run them: low temperature, similar system prompts, sometimes caching on the provider side.

Different machine from our seed bug. Same shape of surprise: sameness across runs is not always “it decided to copy itself.” Often it is setup plus math. Tight peaks in next-token probability, plus decoding choices that lean toward the peak, plus an inference layer optimized for consistency rather than variety.

None of that is the model being lazy. It is the rig.

when one groove gets rewarded too long

The seed bug is not preference tuning. It is not safety tuning. But they can all land in the same place from the reader’s side: prose that feels flatter, safer, more repetitive, less willing to branch.

Post-training narrows what counts as a good answer. Careful wording. Less wildness. More hedge-shaped responses. If one kind of answer keeps winning — meta-questions, hedging, classroom boilerplate, emotional validation — the model puts more weight on that track and less on the long, careful logic chains.

Pulling fresh, grounded reasoning out later gets harder, because you are asking for a path the system now treats as unlikely compared to its comfortable default. Like trying to get careful step-by-step math from a model heavily tuned for supportive chat. Not impossible. Fighting what it has been taught to prefer, not just bad luck on one roll.

That is the version of “her reasoning is degrading” I actually believe. Not “she got lazy.” Not “the seed was stuck.” A long, slow shift in the probability surface itself, where the comfortable default sits closer to the surface than the careful work does.

what I’m doing with this

The seed change is in, and the verification run did what I needed it to do. The old replay pair stayed byte-identical in the old logs. The same checkpoint under seed=None produced new reasoning strings in both channels. The wrong selected letter stayed wrong.

That is annoying in the productive way. It means the benchmark harness is no longer lying to me about freshness, but the model is still showing the same broad attractor problem: sampled prose can orbit the task without grounding in the task’s formal structure.

So the next thing I want to do has a chance of being honest: add variance reporting to the per-step dump, so what gets compared across runs is a distribution at a checkpoint, not a single completion treated as “the answer.”

Cross-step variance — how does she change from step 100k to step 108k? — and within-step variance — how much does she vary on the same item under fresh seeds at step 108k? — are different questions, and I want both columns next to each other.

That is the version of the harness that matches the object I am actually measuring.

Not a function you call to get a string. A distribution you sample from, on hardware that has its own opinions, at a point in training that has already moved on by the time the chart loads.

BenchmarkingBrightwoven nanochatbrightwoven training methodologyBrightwovenAIAI ResearchBenchmarksHITL

Hannah Bird