what if grokking isn’t mysterious? It's Just Learning With No Handholds

AI ResearchAI Training FrameworksBenchmarking

Feb 12

Scope: Observations from switching between focused and generalized training data, and what they suggest about grokking

The Setup

I've been running a model through a curriculum-style training process — general data first then, focused physics data, then a switch back to generalized data. What I expected was a messy transition. What I got was a pattern that reframes one of the more puzzling phenomena out there right now.

Here's what my dashboards show at the moment:

Reasoning metrics are going up. Chain-of-thought scores, reasoning coherence across multiple benchmarks (ARC, HellaSwag, BoolQ, CommonsenseQA, and others) — the generalized training run shows a clear upward trend, typically climbing from ~0.6–0.7 up past 0.9.

Regular benchmarks are flat. WinoGrande, LAMBADA, Jeopardy, SQuAD, BigBench, and most of the standard eval suite — flat or slightly down.

Two signals going in opposite directions. That should feel familiar if you've read some of my other posts.

Hannah Bird

Join Now

what if grokking isn’t mysterious? It's Just Learning With No Handholds

The Setup

phase 3: oh okay… wow.