what if grokking isn’t mysterious? It's Just Learning With No Handholds
Scope: Observations from switching between focused and generalized training data, and what they suggest about grokking
The Setup
I've been running a model through a curriculum-style training process — general data first then, focused physics data, then a switch back to generalized data. What I expected was a messy transition. What I got was a pattern that reframes one of the more puzzling phenomena out there right now.
Here's what my dashboards show at the moment:
Reasoning metrics are going up. Chain-of-thought scores, reasoning coherence across multiple benchmarks (ARC, HellaSwag, BoolQ, CommonsenseQA, and others) — the generalized training run shows a clear upward trend, typically climbing from ~0.6–0.7 up past 0.9.
Regular benchmarks are flat. WinoGrande, LAMBADA, Jeopardy, SQuAD, BigBench, and most of the standard eval suite — flat or slightly down.
Two signals going in opposite directions. That should feel familiar if you've read some of my other posts.