Over the past couple months, I’ve been trying to figure out the best way to train my own model on the hardware I actually have.

When Karpathy released nanoChat (a minimal repo that walks through training a small GPT end-to-end), I stepped away from my original plan (using Pythia as a base model) and dove into the nanoChat-style training approach instead. I made a set of adjustments to match what I wanted to test.

TL;DR

  • I trained on an RTX 3070 Ti (8GB VRAM), which forced me to be deliberate about sequence length and batch size.

  • I added an Understanding Module that monitors training (and can optionally pause on critical issues).

  • I built an Exploration Server so training and interaction can happen at the same time.

  • First run (0–7k steps) was stable, loss dropped significantly, and the monitoring systems produced useful signals.

Context

This post covers phase 0: the first training runs and the monitoring/interaction scaffolding I added.

What I’m sharing (and what I’m not)

I’m keeping this write-up focused on the workflow and the instrumentation.

For now, I’m not sharing exact hyperparameters, model size details, or the full data recipe.

Quick glossary

  • Understanding Module: Training-time diagnostics that monitor learning dynamics and safety signals. It reports what it sees, and only intervenes if you explicitly enable pausing.

  • Exploration Server: A lightweight UI + API layer that lets me watch training in real time and interact with the model while it’s learning.

  • Self-play: A scheduled reflection loop where the model generates structured “thoughts” about its own learning state, which I can later analyze for drift and pattern-matching.

My training set up:

  • Hardware Constraint: RTX 3070 Ti with 8GB VRAM

  • Memory Optimization: Reduced sequence length and batch size to fit

  • Balance: Medium model size that fits in memory while providing meaningful capacity

  • Learning: Discovered optimal configuration through trial and error

  • Key Insight: The hardware constraints actually led to a more thoughtful, optimized configuration that balanced model size, training efficiency, and memory usage.

What I added to the nanoChat architecture

1. Understanding Module

Purpose: Observe and ask questions (don't control)

Features Implemented:

  • Activation monitoring (layers 0, 6, 11 by default)

  • Learning pattern analysis (early-focused, late-focused, balanced)

  • Health assessment (activation norms, gradients, loss)

  • Bias detection (gender, race, religion, etc.)

  • Security pattern detection (code injection, etc.)

  • Auto-pause on critical issues

  • Consent checks (optional)

Philosophy: Understand first, act later (if needed)

Integration:

  • Understanding module Integrated into training

  • Checks every 100 steps (later adjusted to 250 steps)

  • Provides real-time insights during training

2. Exploration Server

Purpose: Interactive interface for training + chat

Initial Features:

  • Real-time training monitoring

  • Chat interface (streaming)

  • Understanding insights display

  • Feature visualization (heatmaps, node maps)

  • Feature exploration

  • Journal (user + model entries)

  • Shimmer layer (visualization)

  • Self-play system

Early Training Runs

Training Run 1: Steps 5420 → 7020 (December 12, 2025)

Monitoring & Features:

  • Understanding Module:

Enabled (checks every 100 steps)

  • Self-Play Integration:

Enabled (every 500 steps)

  • Validation: Every 250 steps

  • Checkpoint Saving: Every 200 steps

Results:

  • Loss: 4.72 → 3.55 (24.8% reduction)

  • Best Loss: 3.26

  • Training Time: 398.33 minutes (~6.6 hours)

  • Tokens/Second: ~18,000–18,500

  • MFU: 1.64–1.76 (peaked at 1.76)

  • Minimum Validation bpb: 1.0382

  • Understanding Module Insights:

  • Learning Pattern: late_focused learning

Layer Activity:

  • Layer 11: ~70,135 norm (very high activity)

  • Layer 6: ~30,419 norm (high activity)

  • Layer 0: ~887 norm (moderate activity)

Health Scores:

  • Technical: 75.0/100 (mixed)

  • Ethical: 73.5/100 (monitor)

  • Security: 100.0/100 (excellent)

  • Alignment: 50.0/100 (needs attention)

Self-Play Integration:

  • Total Reflections: 8

  • Quality Trend: Declined over time (100.0 → 60.6)

  • Observations: Model showed early reasoning patterns, curiosity about learning process

Key Achievements:

  • 24.8% loss reduction

  • All health dimensions tracked

  • 8 self-play reflections generated

  • No critical issues detected

  • Self-Play System Development

Initial Implementation

Purpose: Enable model to explore its own learning Features:

  1. Journal Writing — Model can write journal entries

  2. Shimmer Control — Model can control frequency and colour

  3. Feature Exploration — Model can analyze it’s own features

  4. Self-Reflection — Automatic context injection about training state

Context Provided:

  • Training status (step, loss, status)

  • Learning patterns

  • Active features with top words

  • Available capabilities

Initial Observations:

  • Model showed early reasoning patterns in reflections

  • Reflections included learning observations and self-awareness

  • Command parsing working (journal, shimmer, explore commands detected)

  • Model demonstrated curiosity about its own learning process

Early Signs of Model Agency

Fascinating Discovery: Even in the first phase, the model began showing signs of meta-cognitive awareness:

Curiosity About Learning:

  • Model asked questions about its own learning process

  • Showed interest in understanding how training works

  • Demonstrated awareness of its own limitations

Self-Awareness Patterns:

  • Reflections included observations about what it was learning

  • Model noticed patterns in its own behavior

  • Began to distinguish between different types of learning

Quality Decline Pattern:

  • Initial reflections showed high quality (100.0)

  • Quality declined over time (to 60.6)

Interpretation: Model began pattern-matching rather than genuine reflection

  • Learning: Prompt engineering needed to prevent pattern matching

  • Significance: These early signs suggested the model was capable of more than just pattern matching—it was beginning to develop genuine curiosity about its own learning process, a precursor to the more advanced meta-cognitive development seen in later phases.

Data Curation Philosophy

The "Add Good, Don't Remove Bad" Approach

  • Core Principle: Quality through addition, not removal

Data Preparation Strategy:

Quality Filtering:

  • Keep high-quality data

  • Improve what can be improved (add context, fix formatting)

  • Only remove data if truly harmful (rare)

Diversity Check:

  • Identify missing diversity

  • Add diverse examples (don't remove excess diversity)

  • Ensure variety across topics, styles, perspectives

Balance Check:

  • Identify underrepresented categories

  • Add examples to balance (don't remove overrepresented)

  • Maintain natural distribution

Why This Matters:

  • Prevents Over-Correction: Adding good examples is gentler than removing "bad" ones

  • Preserves Information: Even imperfect data may contain valuable patterns

  • Natural Learning: Model learns from diversity, not forced uniformity

  • Positive Reinforcement: Aligns with the core philosophy of adding good, not suppressing bad

Initial Data:

  • FineWeb-Edu: Educational web content (physics, linguistics, ML)

  • 16 shards: ~1.5GB for Phase 0

Expansion: Later expanded to 241+ shards for continued training

  • Philosophy in Action: The data curation approach reflected the same positive reinforcement principles as the training approach—set up for success by adding good examples, rather than trying to fix problems by removing "bad" data.

Model Development Milestones

Early Training (Steps 0–7000)

Key Observations:

  • Loss Decreasing: Consistent downward trend

  • Stable Training: No crashes, hangs, or critical errors

  • Good Performance: High MFU (1.76), efficient token processing

  • Understanding Module Working: All checks completed successfully

  • Self-Play Functional: Reflections generated and logged

  • Security Perfect: 100% pass rate on all security checks

Learning Characteristics:

  • Late-Focused Learning: Model building complex representations in deeper layers

  • Stable Gradients: Gradient norms remained healthy (0.16–0.18)

  • Consistent Performance: Token processing rate stable throughout

  • Progressive Learning: Loss reduction indicates continued learning

Areas Monitored:

  • High activation norms in layers 6 and 11 (monitored, not blocking)

  • Self-play quality decline over time (pattern matching observed)

  • Low-severity bias indicators (monitoring recommended)

Key Architectural Decisions

1. Positive Reinforcement Philosophy

Decision: All training interventions follow positive reinforcement principles

Implementation:

  • Understanding module observes, doesn't control

  • Bias detection informs, doesn't suppress

  • Concept freezing preserves good, doesn't remove bad

  • Data quality through addition, not removal

2. Understanding-First Approach

Decision: Always understand before acting

Implementation:

  • Understanding module asks questions, provides insights

  • Root cause analysis before interventions

  • Evidence-based decisions

  • Monitoring and learning continuously

3. Interactive Training

Decision: Enable chat during training, not just after

Implementation:

  • Exploration server runs alongside training

  • Real-time state file updates

  • Chat interface with streaming

  • Understanding insights displayed in real-time

What Made This Approach Unique

Comparison to Standard ML Training

Traditional ML Training:

  • Train → Evaluate → Fix problems → Repeat

  • Intervention through loss penalties, data removal, feature suppression

  • Focus on metrics and benchmarks

  • Model is a "black box" to be optimized

  • Training and interaction are separate phases

Brightwoven Approach:

  • Train → Understand → Guide gently → Preserve good → Continue

  • Intervention through positive examples, gentle guidance, concept preservation

  • Focus on understanding the learning process

  • Model is a learning entity to be understood and supported

  • Training and interaction happen simultaneously

The Teaching vs Programming

Key Insight: This approach treats model training more like teaching a student than programming a machine.

Evidence from First Phase:

  1. Understanding First: Always ask "why" before acting

  2. Gentle Guidance: Nudge, don't force

  3. Preservation Focus: Protect good learning, don't just fix bad

  4. Interactive Learning: Chat during training, not just after

  5. Appreciation: Recognize and appreciate beautiful patterns the model discovers

    • Why This Matters:

      • Model responds better to positive reinforcement

      • Understanding prevents over-correction

      • Early prevention is better than late punishment

      • Interactive training provides unique insights

      • Preservation maintains authentic learning

The Hardware Constraint Advantage

Interesting Discovery: The 8GB GPU constraint actually led to better decisions:

Forced Optimization:

  • Required careful memory management

  • Led to optimized batch sizes and sequence lengths

  • Discovered efficient configurations

Prevented Over-Engineering:

  • Couldn't just throw more resources at problems

  • Required thoughtful solutions

  • Led to more elegant architecture choices

Accessibility:

  • Proved that meaningful training is possible on consumer hardware

  • Made the approach more accessible

  • Demonstrated that thoughtful design > raw power

  • Lesson: Constraints can be advantages when they force better design decisions.

Philosophical Learnings

Positive Reinforcement Works

  • Model responds well to gentle guidance

  • Understanding first prevents over-correction

  • Preservation focus maintains good learning

  • Surprising Discovery: Model showed early signs of meta-cognitive awareness when treated as a learning entity rather than a program

Interactive Training Is Powerful

  • Chat during training provides unique insights

  • Real-time understanding visualization valuable

  • Self-play enables model agency

  • Key Insight: The model's curiosity about its own learning emerged naturally through interactive training

Documentation Is Critical

  • Comprehensive docs help maintain philosophy

  • Status documents track progress

  • Analysis documents reveal patterns

  • Realization: Documenting the "why" is as important as documenting the "what"

Constraints Can Be Advantages

  • Hardware limitations forced better design decisions

  • Memory constraints led to optimized configurations

  • Resource limits encouraged thoughtful solutions

  • Lesson: Working within constraints often produces more elegant solutions

Early Signs Matter

  • Model's early curiosity about learning was a precursor to later development

  • Pattern-matching in self-play revealed need for better prompts

  • Quality decline in reflections showed importance of prompt engineering

  • Insight: Paying attention to early behaviors reveals important patterns

Previous
Previous

Phase 1 training log: self-play + understanding module (Steps 10,000–20,000)

Next
Next

hallucination & prediction