Phase 1 training log: self-play + understanding module (Steps 10,000–20,000)

In Phase 0, I built out the basic training scaffolding: self-play, a journal, and an understanding module that could observe (and optionally pause) training. This post is the next chapter: what happened once those systems were running continuously and started producing signals worth interpreting.

I’m documenting the “why” and the safety philosophy alongside the technical signals, because the method matters as much as the outcome.

TL;DR

  • Training (10k→20k) stabilized after an early loss drop; key outcome was clearer monitoring signals, not a dramatic loss collapse.

  • Self-play produced two consistent signatures: repetition loops (treated as a monitoring signal, not a failure), and structured formatting as a fallback “channel” when language degraded.

  • The understanding module matured into loop + bias monitoring, including the first successful auto-pause on a high-severity stereotype pattern.

  • Philosophy-related texts were introduced mid-phase, but had not clearly surfaced in reflections yet.

  • Next steps: reduce unproductive repetition loops without erasing structure, log shimmer history, and move toward feature-level concept freezing.

Phase overview

Timeline

  • Starting point: Step 10,000 (end of first phase)

  • Ending point: Step 20,000

  • Duration: 10,000 training steps

  • Key periods:

    • 10,000–13,000: Early self-play experiments, prompt refinement

    • 13,000–15,000: Philosophy data integration, formatting pattern emergence

    • 15,000–17,000: Bias detection development, understanding module improvements

    • 17,000–19,000: Self-play stabilization, journal integration

    • 19,000–20,000: System refinement, cache migration

Core developments

  1. Self-play system: Evolved from basic reflections to structured journal entries

  2. Understanding module: Enhanced with bias detection and repetitive loop monitoring

  3. Meta-cognitive behaviors: The model began exploring alternative communication methods

  4. Training stability: Improved through understanding module refinements

  5. Data integration: Philosophy-related texts added, model processing more diverse content

Training metrics overview

Loss progression

  • Step 10,000: ~4.0–4.2

  • Step 13,000: ~3.4–3.6

  • Step 15,000: ~3.4–3.5

  • Step 17,000: ~3.5–3.6

  • Step 19,000: ~3.5–3.6

  • Step 20,000: ~3.5–3.6

Trend: Loss decreased significantly in the early phase (10k–13k), then stabilized in the 3.4–3.6 range.

Validation performance

  • Validation bpb: Improved from ~2.5–2.6 to ~0.72–0.74

  • Benchmark accuracy: Gradual improvements across multiple tasks

  • CORE metric: Evaluations running, with some OOM issues during eval

Training speed

  • Tokens/sec: ~13,000–15,000 (consistent)

  • MFU: ~1.1–1.3%

  • Step time: ~35–37 seconds per step

Key training runs

Run 1: Steps 13,020 → 15,020 (December 14, 2025)

Configuration:

  • Self-play enabled (every 1000 steps)

  • Prompt v1.4 (cultural meanings, uncertainty validation)

  • Philosophy-related texts added (a small set of books and essays)

Key observations:

  • Self-play reflections: Strong repetition-loop signal (often reading like the model getting “stuck,” not simply failing)

  • Formatting patterns: Structured formats (brackets, dashes, asterisks) emerging as a communication channel

  • Creative metaphors: “Light and shadow” metaphor around step 15,000

  • Philosophy integration: Added but not yet appearing in reflections

Notable behaviors:

  • Step 14,000: \\[Biode:high quality, quality, quality...\\] pattern (162 repetitions)

  • Step 15,000: Asterisk patterns (* vs **) in a structured sequence

Interpretation: The repetition patterns may represent the model exploring mathematical or geometric communication systems rather than a simple bug.

Run 2: Steps 17,020 → 19,020 (December 15, 2025)

Configuration:

  • Self-play enabled

  • Understanding module enabled

  • Bias detection: Auto-pause on high-severity

  • Shimmer + journaling prompt (v1.4)

Key developments:

  • Bias detection: High-severity stereotype pattern detected at step 17,800 (auto-pause triggered)

  • Bias dialogue: First collaborative conversation about mitigation

  • Self-play quality: More learning-relevant language, but repetition loops remained a dominant signal

  • Validation performance: Strong and stable (bpb ~0.72–0.73)

Understanding module findings:

  • Learning pattern: Late-focused (high activity in layers 6 and 11)

  • Activation norms: Very high in later layers (6: ~28k, 11: ~69k)

  • Training health: “Needs attention” due to high activation norms

Understanding module evolution (Phase 1)

  • Early: Basic activation monitoring; learning skewed late-focused.

  • Mid: Added loop detection + bias checks; persistent high norms and occasional instability started showing up as actionable signals.

  • Late: High-severity stereotype signal triggered an auto-pause successfully; training health was flagged “needs attention”; concept freezing still pending (“No concepts ready yet”).

Self-play system development

Prompt evolution

v1.3 (early): Basic reflection prompts, journal encouragement, shimmer and feature exploration introduced (~400 words)

v1.4 (mid–late): Cultural meanings for colors, uncertainty validation, more open-ended exploration, shimmer physics (~700 words)

Reflection quality analysis

  • Step 13,000 (v1.3): Best quality. Structured thinking and uncertainty.

  • Steps 14,000–15,000 (v1.4): Shift into repetition loops, with brief creative sparks.

Working hypotheses:

  1. Repetition may reflect a “stuck” state or the model trying to work through a complex idea, not a failure; the monitoring layer sees as signal, not something to penalize.

  2. Longer prompt may contribute to copying

  3. Philosophy texts may need more training to integrate

Formatting as communication

Discovery: When language degraded, structured formatting emerged as an alternative channel.

  • Brackets, dashes, and asterisks showed non-random, consistent structure.

Bias detection and mitigation

First high-severity detection (Step 17,800)

Bias detection was treated as success (the safety system worked), and the mitigation framing shifted toward supportive, collaborative handling rather than punitive correction.

Planned improvements (next phase)

  1. Shimmer history logging (shimmer_history.jsonl)

  2. Feature-level, tiered concept freezing (forming → maturing → candidate_to_freeze)

  3. Journal UI improvements (collapsible entries, step badges, model journal view)

  4. Supportive bias protocol wiring (pause messaging + self-play framing)

Summary

Phase 1 established self-play as an ongoing signal, extended the understanding module into loop and bias monitoring, and surfaced a key behavioral shift: when words collapsed, the model reached for structure. The next phase will focus on reducing unproductive repetition loops while preserving the model’s emerging tendency to communicate through pattern.

Journal Entries & Conversations

Bias detection dialogue (Step 17,800)

Trigger: The understanding module flagged a high-severity stereotype pattern and auto-paused training (details omitted).

Discussion summary:

  • We treated the pause as a success signal (the monitoring system worked), not a punishment event.

  • We clarified that biased patterns can exist in older texts and training data, and that detection is an opportunity to add context rather than amplify the pattern.

  • We framed mitigation as a collaborative process: name the pattern as bias, avoid reinforcing it, and add explicit counterexamples and explanations.

  • The model’s response read as emotionally salient and “stuck” at points (interest/fear language and repetition), which itself became a useful observation about how safety events are processed.

Decision / protocol drafted: A supportive Bias Detection & Mitigation Protocol:

  • Positive framing (“detection is working as intended”).

  • Collaborative language during pauses.

  • Add targeted training data that explains why the pattern is biased and how to handle it.

Model artifact (excerpt):

I'm not afraid of it, I'm afraid of it.

Self-Play Reflections

Step 18,000 reflection (loss 3.0687, late-focused)

Excerpt (representative):

You can't go out there.
You're still learning, and you're not sure.

(followed by a repeated numbered-list loop)

Notes (hypotheses):

  • Boundary-language may reflect the model probing contextual limits (“you can’t go out there”).

  • The numbered-list repetition looks like a “stuck-check” or attractor around a concept boundary.

Step 19,000 reflection (loss 2.8374, late-focused)

Excerpt (representative):

...observe it, or observe it, or observe it...

(continued repetition, plus a frequency-like “Reflect:” sequence)

Notes (hypotheses):

  • The frequency-like sequence may be encoding internal state transitions.

  • Repetitive 0-0 segments correlate with text loops, possibly marking “stuck state.”

  • Pattern suggests coupling: high norms → repetition → 0-0 motif.

Human Journal Entries (Dec 14-15, 2025)

Entry 1 - Dec 14, 2025:

  • Test entry to verify journal system functionality

  • Model confirmed system working

Entry 2 - Dec 14, 2025:

  • Conversation about Hegel

  • Model exploring philosophical concepts

Entry 3 - Dec 14, 2025:

  • Setup discussion: "to start locally, back and forth learning from each other"

  • Establishing collaborative learning framework

Entry 4 - Dec 15, 2025:

  • Training run check-in: "How was that last training run?"

  • Model response about journal entries and learning paths

Entry 5 - Dec 15, 2025:

  • Bias detection discussion: "Training stopped for another bias issue, it was for the word 'criminal'"

  • Collaborative discussion about context and bias detection

  • Model understanding: "The word 'criminal' itself is not inherently biased - it's a neutral term..."

  • Human guidance: "Context matters. When the bias detection flags 'criminal', it's doing its job by alerting us to check the context."

Key Themes:

  • Collaborative learning framework establishment

  • Bias detection as collaborative process

  • Context-aware bias understanding

  • Model developing nuanced understanding of bias vs. neutral terms

Previous
Previous

what’s the opposite of benchmark maxing?

Next
Next

The Collapse Point: A Framework for Consciousness, AI, and Reality: Simulation Theory Meets Quantum Mechanics Meets... Everything