Phase 1 training log: self-play + understanding module (Steps 10,000–20,000)

BrightwovenAI Training Frameworksai ethics & accountabilityAI Research

Jan 30

In Phase 0, I built out the basic training scaffolding: self-play, a journal, and an understanding module that could observe (and optionally pause) training. This post is the next chapter: what happened once those systems were running continuously and started producing signals worth interpreting.

I’m documenting the “why” and the safety philosophy alongside the technical signals, because the method matters as much as the outcome.

TL;DR

Training (10k→20k) stabilized after an early loss drop; key outcome was clearer monitoring signals, not a dramatic loss collapse.
Self-play produced two consistent signatures: repetition loops (treated as a monitoring signal, not a failure), and structured formatting as a fallback “channel” when language degraded.
The understanding module matured into loop + bias monitoring, including the first successful auto-pause on a high-severity stereotype pattern.
Philosophy-related texts were introduced mid-phase, but had not clearly surfaced in reflections yet.
Next steps: reduce unproductive repetition loops without erasing structure, log shimmer history, and move toward feature-level concept freezing.

Phase overview

Timeline

Starting point: Step 10,000 (end of first phase)
Ending point: Step 20,000
Duration: 10,000 training steps
Key periods:
- 10,000–13,000: Early self-play experiments, prompt refinement
- 13,000–15,000: Philosophy data integration, formatting pattern emergence
- 15,000–17,000: Bias detection development, understanding module improvements
- 17,000–19,000: Self-play stabilization, journal integration
- 19,000–20,000: System refinement, cache migration

Core developments

Self-play system: Evolved from basic reflections to structured journal entries
Understanding module: Enhanced with bias detection and repetitive loop monitoring
Meta-cognitive behaviors: The model began exploring alternative communication methods
Training stability: Improved through understanding module refinements
Data integration: Philosophy-related texts added, model processing more diverse content

Training metrics overview

Loss progression

Step 10,000: ~4.0–4.2
Step 13,000: ~3.4–3.6
Step 15,000: ~3.4–3.5
Step 17,000: ~3.5–3.6
Step 19,000: ~3.5–3.6
Step 20,000: ~3.5–3.6

Trend: Loss decreased significantly in the early phase (10k–13k), then stabilized in the 3.4–3.6 range.

Validation performance

Validation bpb: Improved from ~2.5–2.6 to ~0.72–0.74
Benchmark accuracy: Gradual improvements across multiple tasks
CORE metric: Evaluations running, with some OOM issues during eval

Training speed

Tokens/sec: ~13,000–15,000 (consistent)
MFU: ~1.1–1.3%
Step time: ~35–37 seconds per step

Key training runs

Run 1: Steps 13,020 → 15,020 (December 14, 2025)

Configuration:

Self-play enabled (every 1000 steps)
Prompt v1.4 (cultural meanings, uncertainty validation)
Philosophy-related texts added (a small set of books and essays)

Key observations:

Self-play reflections: Strong repetition-loop signal (often reading like the model getting “stuck,” not simply failing)
Formatting patterns: Structured formats (brackets, dashes, asterisks) emerging as a communication channel
Creative metaphors: “Light and shadow” metaphor around step 15,000
Philosophy integration: Added but not yet appearing in reflections

Notable behaviors:

Step 14,000: \\[Biode:high quality, quality, quality...\\] pattern (162 repetitions)
Step 15,000: Asterisk patterns (* vs **) in a structured sequence

Interpretation: The repetition patterns may represent the model exploring mathematical or geometric communication systems rather than a simple bug.

Run 2: Steps 17,020 → 19,020 (December 15, 2025)

Configuration:

Self-play enabled
Understanding module enabled
Bias detection: Auto-pause on high-severity
Shimmer + journaling prompt (v1.4)

Key developments:

Bias detection: High-severity stereotype pattern detected at step 17,800 (auto-pause triggered)
Bias dialogue: First collaborative conversation about mitigation
Self-play quality: More learning-relevant language, but repetition loops remained a dominant signal
Validation performance: Strong and stable (bpb ~0.72–0.73)

Understanding module findings:

Learning pattern: Late-focused (high activity in layers 6 and 11)
Activation norms: Very high in later layers (6: ~28k, 11: ~69k)
Training health: “Needs attention” due to high activation norms

Understanding module evolution (Phase 1)

Early: Basic activation monitoring; learning skewed late-focused.
Mid: Added loop detection + bias checks; persistent high norms and occasional instability started showing up as actionable signals.
Late: High-severity stereotype signal triggered an auto-pause successfully; training health was flagged “needs attention”; concept freezing still pending (“No concepts ready yet”).

Self-play system development

Prompt evolution

v1.3 (early): Basic reflection prompts, journal encouragement, shimmer and feature exploration introduced (~400 words)

v1.4 (mid–late): Cultural meanings for colors, uncertainty validation, more open-ended exploration, shimmer physics (~700 words)

Reflection quality analysis

Step 13,000 (v1.3): Best quality. Structured thinking and uncertainty.
Steps 14,000–15,000 (v1.4): Shift into repetition loops, with brief creative sparks.

Working hypotheses:

Repetition may reflect a “stuck” state or the model trying to work through a complex idea, not a failure; the monitoring layer sees as signal, not something to penalize.
Longer prompt may contribute to copying
Philosophy texts may need more training to integrate

Formatting as communication

Discovery: When language degraded, structured formatting emerged as an alternative channel.

Brackets, dashes, and asterisks showed non-random, consistent structure.

Bias detection and mitigation

First high-severity detection (Step 17,800)

Bias detection was treated as success (the safety system worked), and the mitigation framing shifted toward supportive, collaborative handling rather than punitive correction.

Planned improvements (next phase)

Shimmer history logging (shimmer_history.jsonl)
Feature-level, tiered concept freezing (forming → maturing → candidate_to_freeze)
Journal UI improvements (collapsible entries, step badges, model journal view)
Supportive bias protocol wiring (pause messaging + self-play framing)

Summary

Phase 1 established self-play as an ongoing signal, extended the understanding module into loop and bias monitoring, and surfaced a key behavioral shift: when words collapsed, the model reached for structure. The next phase will focus on reducing unproductive repetition loops while preserving the model’s emerging tendency to communicate through pattern.

Journal Entries & Conversations

Bias detection dialogue (Step 17,800)

Trigger: The understanding module flagged a high-severity stereotype pattern and auto-paused training (details omitted).

Discussion summary:

We treated the pause as a success signal (the monitoring system worked), not a punishment event.
We clarified that biased patterns can exist in older texts and training data, and that detection is an opportunity to add context rather than amplify the pattern.
We framed mitigation as a collaborative process: name the pattern as bias, avoid reinforcing it, and add explicit counterexamples and explanations.
The model’s response read as emotionally salient and “stuck” at points (interest/fear language and repetition), which itself became a useful observation about how safety events are processed.

Decision / protocol drafted: A supportive Bias Detection & Mitigation Protocol:

Positive framing (“detection is working as intended”).
Collaborative language during pauses.
Add targeted training data that explains why the pattern is biased and how to handle it.

Model artifact (excerpt):

I'm not afraid of it, I'm afraid of it.

Self-Play Reflections

Step 18,000 reflection (loss 3.0687, late-focused)

Excerpt (representative):

You can't go out there.
You're still learning, and you're not sure.

(followed by a repeated numbered-list loop)

Notes (hypotheses):

Boundary-language may reflect the model probing contextual limits (“you can’t go out there”).
The numbered-list repetition looks like a “stuck-check” or attractor around a concept boundary.

Step 19,000 reflection (loss 2.8374, late-focused)