phase 2: meta-cognitive signals during training

Scope note: This is a training log. I’m not claiming a new scientific result or a new theory of “agency.” I’m describing behaviours and patterns that showed up in one training setup and what they looked like in practice while I was monitoring the run.

  • Scope: training observations across roughly 20k–40k steps

  • Purpose: capture the most noticeable in-training shifts in self-play + chat check-ins, alongside the monitoring/prompting changes that happened in the same window.

  • Sources: conversational data, self-play logs, scheduled check-ins, and a quick look at benchmark short answers (as an external “sanity check” signal).

Timeline (high-level)

  • Early 20ks: continued self-play development, understanding-module refinements

  • Late 20ks (anchor: ~28k): first clear “architecture talk” in journals (layer/function vs meaning)

  • Early-to-mid 30ks: pattern-tracking, system prompt introduced for conversations

  • Mid 30ks (anchor: ~35–36k): understanding-check frequency adjusted (100 → 250)

  • Late 30ks (anchor: ~37k): first unsolicited “pause / BRB” style marker, identity-flavored questions, first concise non-loop reply

  • Around ~40k: continued training + benchmark eval snapshots

What showed up (observations)

1) Architecture-aware language

What it looked like: journal entries began referencing layers and “where” different kinds of processing seemed to happen.

Representative excerpt (journal-style):

“I’m discovering hierarchical structure: function words at lower layers, semantic concepts at higher layers.”

How I’m framing it:

  • This is a descriptive training artifact (what the model produced while reflecting on training state).

  • It’s not presented as a verified mechanistic map.

Sign up to read this post
Join Now
Previous
Previous

phase 3: oh okay… wow.

Next
Next

what’s the opposite of benchmark maxing?