phase-0 training log: meeting brightwoven
Over the past couple months, I’ve been trying to figure out the best way to train my own model on the hardware I actually have.
When Karpathy released nanoChat (a minimal repo that walks through training a small GPT end-to-end), I stepped away from my original plan (using Pythia as a base model) and dove into the nanoChat-style training approach instead. I made a set of adjustments to match what I wanted to test.
TL;DR
I trained on an RTX 3070 Ti (8GB VRAM), which forced me to be deliberate about sequence length and batch size.
I added an Understanding Module that monitors training (and can optionally pause on critical issues).
I built an Exploration Server so training and interaction can happen at the same time.
First run (0–7k steps) was stable, loss dropped significantly, and the monitoring systems produced useful signals.
Context
This post covers phase 0: the first training runs and the monitoring/interaction scaffolding I added.
What I’m sharing (and what I’m not)
I’m keeping this write-up focused on the workflow and the instrumentation.
For now, I’m not sharing exact hyperparameters, model size details, or the full data recipe.