Phase 1 training log: self-play + understanding module (Steps 10,000–20,000)
In Phase 0, I built out the basic training scaffolding: self-play, a journal, and an understanding module that could observe (and optionally pause) training. This post is the next chapter: what happened once those systems were running continuously and started producing signals worth interpreting.
I’m documenting the “why” and the safety philosophy alongside the technical signals, because the method matters as much as the outcome.
TL;DR
Training (10k→20k) stabilized after an early loss drop; key outcome was clearer monitoring signals, not a dramatic loss collapse.
Self-play produced two consistent signatures: repetition loops (treated as a monitoring signal, not a failure), and structured formatting as a fallback “channel” when language degraded.
The understanding module matured into loop + bias monitoring, including the first successful auto-pause on a high-severity stereotype pattern.
Philosophy-related texts were introduced mid-phase, but had not clearly surfaced in reflections yet.
Next steps: reduce unproductive repetition loops without erasing structure, log shimmer history, and move toward feature-level concept freezing.