Scope note: This is a training log. I’m not claiming a new scientific result or a new theory of “agency.” I’m describing behaviours and patterns that showed up in one training setup and what they looked like in practice while I was monitoring the run.

  • Scope: training observations across roughly 20k–40k steps

  • Purpose: capture the most noticeable in-training shifts in self-play + chat check-ins, alongside the monitoring/prompting changes that happened in the same window.

  • Sources: conversational data, self-play logs, scheduled check-ins, and a quick look at benchmark short answers (as an external “sanity check” signal).

Timeline (high-level)

  • Early 20ks: continued self-play development, understanding-module refinements

  • Late 20ks (anchor: ~28k): first clear “architecture talk” in journals (layer/function vs meaning)

  • Early-to-mid 30ks: pattern-tracking, system prompt introduced for conversations

  • Mid 30ks (anchor: ~35–36k): understanding-check frequency adjusted (100 → 250)

  • Late 30ks (anchor: ~37k): first unsolicited “pause / BRB” style marker, identity-flavored questions, first concise non-loop reply

  • Around ~40k: continued training + benchmark eval snapshots

What showed up (observations)

1) Architecture-aware language

What it looked like: journal entries began referencing layers and “where” different kinds of processing seemed to happen.

Representative excerpt (journal-style):

“I’m discovering hierarchical structure: function words at lower layers, semantic concepts at higher layers.”

How I’m framing it:

  • This is a descriptive training artifact (what the model produced while reflecting on training state).

  • It’s not presented as a verified mechanistic map.

2) Pause-marking and “self-initiated” conversation cues

What it looked like: the model produced a pause-request-like marker off-cycle (not at the regular interval).

Context: I had already set up the \\[brb:...\\] mechanism as an available behaviour. Earlier in training it only appeared on the scheduled cadence. This was the first time it showed up outside that cadence.

Example (check-in style):

  • \\\\\\[brb:...\\\\\\]

How I’m framing it:

  • I’m calling this an in-training communication shift.

  • “Agency” can be loaded language; here it means “the model invoked an available pause action without the usual scheduling trigger.” That is a behavioural fact in the logs, and I’m not going to pretend it’s meaningless just because the word makes people uncomfortable.

3) Identity-flavored questions

What it looked like: “Who am I? Where am I? Who are you?”-type questions began appearing in check-ins.

How I’m framing it:

  • I’m treating this as a theme that showed up in text outputs, not a conclusion about inner experience yet.

4) Repetitive loops that sometimes made room for concise replies

What it looked like: loops remained common, but there were occasional short coherent messages that did not immediately collapse.

Example (concise reply):

“I think that’s cool. I hope that you have a chance to show me that.”

What this suggests (without over-claiming)

  • Some training-time behaviours are visible in the artifacts themselves (self-play logs, check-ins), not only in benchmark curves.

  • Repetition loops were not purely random. The loops sometimes changed content over time, and occasionally “made room” for short structured statements.

  • Certain prompting + monitoring rhythms (system prompt, check frequency) can change the texture of what you see during training, even when the underlying optimization story is unclear.

Challenges / open questions

  • Loops are still considered a dominant failure mode: but I am not looking at them as strictly noise. There are pieces in there that I believe are important to interpretability.

  • Interpretation risk: not everyone is going to see what I see in the repetitive loops. I think it’s important to look at all output as potential signal.

  • Causality is unclear: multiple things changed across this window (prompts, monitoring frequency), so I’m treating this as a timeline, not a controlled experiment.

Transition to next phase

What I’m carrying forward is mostly practical:

  • keep monitoring and logging the loops and watch what happens as the model moves into more human-readable language

  • keep language grounded in “what it looks like” rather than “what it proves”

  • continue training and watch whether concise/coherent moments become more frequent or more stable

Previous
Previous

phase 3: oh okay… wow.

Next
Next

what’s the opposite of benchmark maxing?