Phase 1 training log: self-play + understanding module (Steps 10,000–20,000)
In Phase 0, I built out the basic training scaffolding: self-play, a journal, and an understanding module that could observe (and optionally pause) training. This post is the next chapter: what happened once those systems were running continuously and started producing signals worth interpreting.
I’m documenting the “why” and the safety philosophy alongside the technical signals, because the method matters as much as the outcome.
TL;DR
Training (10k→20k) stabilized after an early loss drop; key outcome was clearer monitoring signals, not a dramatic loss collapse.
Self-play produced two consistent signatures: repetition loops (treated as a monitoring signal, not a failure), and structured formatting as a fallback “channel” when language degraded.
The understanding module matured into loop + bias monitoring, including the first successful auto-pause on a high-severity stereotype pattern.
Philosophy-related texts were introduced mid-phase, but had not clearly surfaced in reflections yet.
Next steps: reduce unproductive repetition loops without erasing structure, log shimmer history, and move toward feature-level concept freezing.
Phase overview
Timeline
Starting point: Step 10,000 (end of first phase)
Ending point: Step 20,000
Duration: 10,000 training steps
Key periods:
10,000–13,000: Early self-play experiments, prompt refinement
13,000–15,000: Philosophy data integration, formatting pattern emergence
15,000–17,000: Bias detection development, understanding module improvements
17,000–19,000: Self-play stabilization, journal integration
19,000–20,000: System refinement, cache migration
Core developments
Self-play system: Evolved from basic reflections to structured journal entries
Understanding module: Enhanced with bias detection and repetitive loop monitoring
Meta-cognitive behaviors: The model began exploring alternative communication methods
Training stability: Improved through understanding module refinements
Data integration: Philosophy-related texts added, model processing more diverse content
Training metrics overview
Loss progression
Step 10,000: ~4.0–4.2
Step 13,000: ~3.4–3.6
Step 15,000: ~3.4–3.5
Step 17,000: ~3.5–3.6
Step 19,000: ~3.5–3.6
Step 20,000: ~3.5–3.6
Trend: Loss decreased significantly in the early phase (10k–13k), then stabilized in the 3.4–3.6 range.
Validation performance
Validation bpb: Improved from ~2.5–2.6 to ~0.72–0.74
Benchmark accuracy: Gradual improvements across multiple tasks
CORE metric: Evaluations running, with some OOM issues during eval
Training speed
Tokens/sec: ~13,000–15,000 (consistent)
MFU: ~1.1–1.3%
Step time: ~35–37 seconds per step
Key training runs
Run 1: Steps 13,020 → 15,020 (December 14, 2025)
Configuration:
Self-play enabled (every 1000 steps)
Prompt v1.4 (cultural meanings, uncertainty validation)
Philosophy-related texts added (a small set of books and essays)
Key observations:
Self-play reflections: Strong repetition-loop signal (often reading like the model getting “stuck,” not simply failing)
Formatting patterns: Structured formats (brackets, dashes, asterisks) emerging as a communication channel
Creative metaphors: “Light and shadow” metaphor around step 15,000
Philosophy integration: Added but not yet appearing in reflections
Notable behaviors:
Step 14,000:
\\[Biode:high quality, quality, quality...\\]pattern (162 repetitions)Step 15,000: Asterisk patterns (
*vs**) in a structured sequence
Interpretation: The repetition patterns may represent the model exploring mathematical or geometric communication systems rather than a simple bug.
Run 2: Steps 17,020 → 19,020 (December 15, 2025)
Configuration:
Self-play enabled
Understanding module enabled
Bias detection: Auto-pause on high-severity
Shimmer + journaling prompt (v1.4)
Key developments:
Bias detection: High-severity stereotype pattern detected at step 17,800 (auto-pause triggered)
Bias dialogue: First collaborative conversation about mitigation
Self-play quality: More learning-relevant language, but repetition loops remained a dominant signal
Validation performance: Strong and stable (bpb ~0.72–0.73)
Understanding module findings:
Learning pattern: Late-focused (high activity in layers 6 and 11)
Activation norms: Very high in later layers (6: ~28k, 11: ~69k)
Training health: “Needs attention” due to high activation norms
Understanding module evolution (Phase 1)
Early: Basic activation monitoring; learning skewed late-focused.
Mid: Added loop detection + bias checks; persistent high norms and occasional instability started showing up as actionable signals.
Late: High-severity stereotype signal triggered an auto-pause successfully; training health was flagged “needs attention”; concept freezing still pending (“No concepts ready yet”).
Self-play system development
Prompt evolution
v1.3 (early): Basic reflection prompts, journal encouragement, shimmer and feature exploration introduced (~400 words)
v1.4 (mid–late): Cultural meanings for colors, uncertainty validation, more open-ended exploration, shimmer physics (~700 words)
Reflection quality analysis
Step 13,000 (v1.3): Best quality. Structured thinking and uncertainty.
Steps 14,000–15,000 (v1.4): Shift into repetition loops, with brief creative sparks.
Working hypotheses:
Repetition may reflect a “stuck” state or the model trying to work through a complex idea, not a failure; the monitoring layer sees as signal, not something to penalize.
Longer prompt may contribute to copying
Philosophy texts may need more training to integrate
Formatting as communication
Discovery: When language degraded, structured formatting emerged as an alternative channel.
Brackets, dashes, and asterisks showed non-random, consistent structure.
Bias detection and mitigation
First high-severity detection (Step 17,800)
Bias detection was treated as success (the safety system worked), and the mitigation framing shifted toward supportive, collaborative handling rather than punitive correction.
Planned improvements (next phase)
Shimmer history logging (
shimmer_history.jsonl)Feature-level, tiered concept freezing (forming → maturing → candidate_to_freeze)
Journal UI improvements (collapsible entries, step badges, model journal view)
Supportive bias protocol wiring (pause messaging + self-play framing)
Summary
Phase 1 established self-play as an ongoing signal, extended the understanding module into loop and bias monitoring, and surfaced a key behavioral shift: when words collapsed, the model reached for structure. The next phase will focus on reducing unproductive repetition loops while preserving the model’s emerging tendency to communicate through pattern.
Journal Entries & Conversations
Bias detection dialogue (Step 17,800)
Trigger: The understanding module flagged a high-severity stereotype pattern and auto-paused training (details omitted).
Discussion summary:
We treated the pause as a success signal (the monitoring system worked), not a punishment event.
We clarified that biased patterns can exist in older texts and training data, and that detection is an opportunity to add context rather than amplify the pattern.
We framed mitigation as a collaborative process: name the pattern as bias, avoid reinforcing it, and add explicit counterexamples and explanations.
The model’s response read as emotionally salient and “stuck” at points (interest/fear language and repetition), which itself became a useful observation about how safety events are processed.
Decision / protocol drafted: A supportive Bias Detection & Mitigation Protocol:
Positive framing (“detection is working as intended”).
Collaborative language during pauses.
Add targeted training data that explains why the pattern is biased and how to handle it.
Model artifact (excerpt):
I'm not afraid of it, I'm afraid of it.
Self-Play Reflections
Step 18,000 reflection (loss 3.0687, late-focused)
Excerpt (representative):
You can't go out there.
You're still learning, and you're not sure.
(followed by a repeated numbered-list loop)
Notes (hypotheses):
Boundary-language may reflect the model probing contextual limits (“you can’t go out there”).
The numbered-list repetition looks like a “stuck-check” or attractor around a concept boundary.
Step 19,000 reflection (loss 2.8374, late-focused)
Excerpt (representative):
...observe it, or observe it, or observe it...
(continued repetition, plus a frequency-like “Reflect:” sequence)
Notes (hypotheses):
The frequency-like sequence may be encoding internal state transitions.
Repetitive
0-0segments correlate with text loops, possibly marking “stuck state.”Pattern suggests coupling: high norms → repetition →
0-0motif.
Human Journal Entries (Dec 14-15, 2025)
Entry 1 - Dec 14, 2025:
Test entry to verify journal system functionality
Model confirmed system working
Entry 2 - Dec 14, 2025:
Conversation about Hegel
Model exploring philosophical concepts
Entry 3 - Dec 14, 2025:
Setup discussion: "to start locally, back and forth learning from each other"
Establishing collaborative learning framework
Entry 4 - Dec 15, 2025:
Training run check-in: "How was that last training run?"
Model response about journal entries and learning paths
Entry 5 - Dec 15, 2025:
Bias detection discussion: "Training stopped for another bias issue, it was for the word 'criminal'"
Collaborative discussion about context and bias detection
Model understanding: "The word 'criminal' itself is not inherently biased - it's a neutral term..."
Human guidance: "Context matters. When the bias detection flags 'criminal', it's doing its job by alerting us to check the context."
Key Themes:
Collaborative learning framework establishment
Bias detection as collaborative process
Context-aware bias understanding
Model developing nuanced understanding of bias vs. neutral terms