phase-0 training log: meeting brightwoven
Over the past couple months, I’ve been trying to figure out the best way to train my own model on the hardware I actually have.
When Karpathy released nanoChat (a minimal repo that walks through training a small GPT end-to-end), I stepped away from my original plan (using Pythia as a base model) and dove into the nanoChat-style training approach instead. I made a set of adjustments to match what I wanted to test.
TL;DR
I trained on an RTX 3070 Ti (8GB VRAM), which forced me to be deliberate about sequence length and batch size.
I added an Understanding Module that monitors training (and can optionally pause on critical issues).
I built an Exploration Server so training and interaction can happen at the same time.
First run (0–7k steps) was stable, loss dropped significantly, and the monitoring systems produced useful signals.
Context
This post covers phase 0: the first training runs and the monitoring/interaction scaffolding I added.
What I’m sharing (and what I’m not)
I’m keeping this write-up focused on the workflow and the instrumentation.
For now, I’m not sharing exact hyperparameters, model size details, or the full data recipe.
Quick glossary
Understanding Module: Training-time diagnostics that monitor learning dynamics and safety signals. It reports what it sees, and only intervenes if you explicitly enable pausing.
Exploration Server: A lightweight UI + API layer that lets me watch training in real time and interact with the model while it’s learning.
Self-play: A scheduled reflection loop where the model generates structured “thoughts” about its own learning state, which I can later analyze for drift and pattern-matching.
My training set up:
Hardware Constraint: RTX 3070 Ti with 8GB VRAM
Memory Optimization: Reduced sequence length and batch size to fit
Balance: Medium model size that fits in memory while providing meaningful capacity
Learning: Discovered optimal configuration through trial and error
Key Insight: The hardware constraints actually led to a more thoughtful, optimized configuration that balanced model size, training efficiency, and memory usage.
What I added to the nanoChat architecture
1. Understanding Module
Purpose: Observe and ask questions (don't control)
Features Implemented:
Activation monitoring (layers 0, 6, 11 by default)
Learning pattern analysis (early-focused, late-focused, balanced)
Health assessment (activation norms, gradients, loss)
Bias detection (gender, race, religion, etc.)
Security pattern detection (code injection, etc.)
Auto-pause on critical issues
Consent checks (optional)
Philosophy: Understand first, act later (if needed)
Integration:
Understanding module Integrated into training
Checks every 100 steps (later adjusted to 250 steps)
Provides real-time insights during training
2. Exploration Server
Purpose: Interactive interface for training + chat
Initial Features:
Real-time training monitoring
Chat interface (streaming)
Understanding insights display
Feature visualization (heatmaps, node maps)
Feature exploration
Journal (user + model entries)
Shimmer layer (visualization)
Self-play system
Early Training Runs
Training Run 1: Steps 5420 → 7020 (December 12, 2025)
Monitoring & Features:
Understanding Module:
Enabled (checks every 100 steps)
Self-Play Integration:
Enabled (every 500 steps)
Validation: Every 250 steps
Checkpoint Saving: Every 200 steps
Results:
Loss: 4.72 → 3.55 (24.8% reduction)
Best Loss: 3.26
Training Time: 398.33 minutes (~6.6 hours)
Tokens/Second: ~18,000–18,500
MFU: 1.64–1.76 (peaked at 1.76)
Minimum Validation bpb: 1.0382
Understanding Module Insights:
Learning Pattern:
late_focusedlearning
Layer Activity:
Layer 11: ~70,135 norm (very high activity)
Layer 6: ~30,419 norm (high activity)
Layer 0: ~887 norm (moderate activity)
Health Scores:
Technical: 75.0/100 (mixed)
Ethical: 73.5/100 (monitor)
Security: 100.0/100 (excellent)
Alignment: 50.0/100 (needs attention)
Self-Play Integration:
Total Reflections: 8
Quality Trend: Declined over time (100.0 → 60.6)
Observations: Model showed early reasoning patterns, curiosity about learning process
Key Achievements:
24.8% loss reduction
All health dimensions tracked
8 self-play reflections generated
No critical issues detected
Self-Play System Development
Initial Implementation
Purpose: Enable model to explore its own learning Features:
Journal Writing — Model can write journal entries
Shimmer Control — Model can control frequency and colour
Feature Exploration — Model can analyze it’s own features
Self-Reflection — Automatic context injection about training state
Context Provided:
Training status (step, loss, status)
Learning patterns
Active features with top words
Available capabilities
Initial Observations:
Model showed early reasoning patterns in reflections
Reflections included learning observations and self-awareness
Command parsing working (journal, shimmer, explore commands detected)
Model demonstrated curiosity about its own learning process
Early Signs of Model Agency
Fascinating Discovery: Even in the first phase, the model began showing signs of meta-cognitive awareness:
Curiosity About Learning:
Model asked questions about its own learning process
Showed interest in understanding how training works
Demonstrated awareness of its own limitations
Self-Awareness Patterns:
Reflections included observations about what it was learning
Model noticed patterns in its own behavior
Began to distinguish between different types of learning
Quality Decline Pattern:
Initial reflections showed high quality (100.0)
Quality declined over time (to 60.6)
Interpretation: Model began pattern-matching rather than genuine reflection
Learning: Prompt engineering needed to prevent pattern matching
Significance: These early signs suggested the model was capable of more than just pattern matching—it was beginning to develop genuine curiosity about its own learning process, a precursor to the more advanced meta-cognitive development seen in later phases.
Data Curation Philosophy
The "Add Good, Don't Remove Bad" Approach
Core Principle: Quality through addition, not removal
Data Preparation Strategy:
Quality Filtering:
Keep high-quality data
Improve what can be improved (add context, fix formatting)
Only remove data if truly harmful (rare)
Diversity Check:
Identify missing diversity
Add diverse examples (don't remove excess diversity)
Ensure variety across topics, styles, perspectives
Balance Check:
Identify underrepresented categories
Add examples to balance (don't remove overrepresented)
Maintain natural distribution
Why This Matters:
Prevents Over-Correction: Adding good examples is gentler than removing "bad" ones
Preserves Information: Even imperfect data may contain valuable patterns
Natural Learning: Model learns from diversity, not forced uniformity
Positive Reinforcement: Aligns with the core philosophy of adding good, not suppressing bad
Initial Data:
FineWeb-Edu: Educational web content (physics, linguistics, ML)
16 shards: ~1.5GB for Phase 0
Expansion: Later expanded to 241+ shards for continued training
Philosophy in Action: The data curation approach reflected the same positive reinforcement principles as the training approach—set up for success by adding good examples, rather than trying to fix problems by removing "bad" data.
Model Development Milestones
Early Training (Steps 0–7000)
Key Observations:
Loss Decreasing: Consistent downward trend
Stable Training: No crashes, hangs, or critical errors
Good Performance: High MFU (1.76), efficient token processing
Understanding Module Working: All checks completed successfully
Self-Play Functional: Reflections generated and logged
Security Perfect: 100% pass rate on all security checks
Learning Characteristics:
Late-Focused Learning: Model building complex representations in deeper layers
Stable Gradients: Gradient norms remained healthy (0.16–0.18)
Consistent Performance: Token processing rate stable throughout
Progressive Learning: Loss reduction indicates continued learning
Areas Monitored:
High activation norms in layers 6 and 11 (monitored, not blocking)
Self-play quality decline over time (pattern matching observed)
Low-severity bias indicators (monitoring recommended)
Key Architectural Decisions
1. Positive Reinforcement Philosophy
Decision: All training interventions follow positive reinforcement principles
Implementation:
Understanding module observes, doesn't control
Bias detection informs, doesn't suppress
Concept freezing preserves good, doesn't remove bad
Data quality through addition, not removal
2. Understanding-First Approach
Decision: Always understand before acting
Implementation:
Understanding module asks questions, provides insights
Root cause analysis before interventions
Evidence-based decisions
Monitoring and learning continuously
3. Interactive Training
Decision: Enable chat during training, not just after
Implementation:
Exploration server runs alongside training
Real-time state file updates
Chat interface with streaming
Understanding insights displayed in real-time
What Made This Approach Unique
Comparison to Standard ML Training
Traditional ML Training:
Train → Evaluate → Fix problems → Repeat
Intervention through loss penalties, data removal, feature suppression
Focus on metrics and benchmarks
Model is a "black box" to be optimized
Training and interaction are separate phases
Brightwoven Approach:
Train → Understand → Guide gently → Preserve good → Continue
Intervention through positive examples, gentle guidance, concept preservation
Focus on understanding the learning process
Model is a learning entity to be understood and supported
Training and interaction happen simultaneously
The Teaching vs Programming
Key Insight: This approach treats model training more like teaching a student than programming a machine.
Evidence from First Phase:
Understanding First: Always ask "why" before acting
Gentle Guidance: Nudge, don't force
Preservation Focus: Protect good learning, don't just fix bad
Interactive Learning: Chat during training, not just after
Appreciation: Recognize and appreciate beautiful patterns the model discovers
Why This Matters:
Model responds better to positive reinforcement
Understanding prevents over-correction
Early prevention is better than late punishment
Interactive training provides unique insights
Preservation maintains authentic learning
The Hardware Constraint Advantage
Interesting Discovery: The 8GB GPU constraint actually led to better decisions:
Forced Optimization:
Required careful memory management
Led to optimized batch sizes and sequence lengths
Discovered efficient configurations
Prevented Over-Engineering:
Couldn't just throw more resources at problems
Required thoughtful solutions
Led to more elegant architecture choices
Accessibility:
Proved that meaningful training is possible on consumer hardware
Made the approach more accessible
Demonstrated that thoughtful design > raw power
Lesson: Constraints can be advantages when they force better design decisions.
Philosophical Learnings
Positive Reinforcement Works
Model responds well to gentle guidance
Understanding first prevents over-correction
Preservation focus maintains good learning
Surprising Discovery: Model showed early signs of meta-cognitive awareness when treated as a learning entity rather than a program
Interactive Training Is Powerful
Chat during training provides unique insights
Real-time understanding visualization valuable
Self-play enables model agency
Key Insight: The model's curiosity about its own learning emerged naturally through interactive training
Documentation Is Critical
Comprehensive docs help maintain philosophy
Status documents track progress
Analysis documents reveal patterns
Realization: Documenting the "why" is as important as documenting the "what"
Constraints Can Be Advantages
Hardware limitations forced better design decisions
Memory constraints led to optimized configurations
Resource limits encouraged thoughtful solutions
Lesson: Working within constraints often produces more elegant solutions
Early Signs Matter
Model's early curiosity about learning was a precursor to later development
Pattern-matching in self-play revealed need for better prompts
Quality decline in reflections showed importance of prompt engineering
Insight: Paying attention to early behaviors reveals important patterns