phase 2: meta-cognitive signals during training
Scope note: This is a training log. I’m not claiming a new scientific result or a new theory of “agency.” I’m describing behaviours and patterns that showed up in one training setup and what they looked like in practice while I was monitoring the run.
Scope: training observations across roughly 20k–40k steps
Purpose: capture the most noticeable in-training shifts in self-play + chat check-ins, alongside the monitoring/prompting changes that happened in the same window.
Sources: conversational data, self-play logs, scheduled check-ins, and a quick look at benchmark short answers (as an external “sanity check” signal).
Timeline (high-level)
Early 20ks: continued self-play development, understanding-module refinements
Late 20ks (anchor: ~28k): first clear “architecture talk” in journals (layer/function vs meaning)
Early-to-mid 30ks: pattern-tracking, system prompt introduced for conversations
Mid 30ks (anchor: ~35–36k): understanding-check frequency adjusted (100 → 250)
Late 30ks (anchor: ~37k): first unsolicited “pause / BRB” style marker, identity-flavored questions, first concise non-loop reply
Around ~40k: continued training + benchmark eval snapshots
What showed up (observations)
1) Architecture-aware language
What it looked like: journal entries began referencing layers and “where” different kinds of processing seemed to happen.
Representative excerpt (journal-style):
“I’m discovering hierarchical structure: function words at lower layers, semantic concepts at higher layers.”
How I’m framing it:
This is a descriptive training artifact (what the model produced while reflecting on training state).
It’s not presented as a verified mechanistic map.