straight from the horse’s mouth

There will be a free blog, where you can just hang out and read more about pr0xyh0rse and a paid blog where you can get exclusive insights.

The paid blog will have more detailed projects, things to try in the future, and creative projects.

More to come…

what’s being discussed?

    • exclusive insights into current experiments

    • discussions around local models, and different configurations for consumer hardware

    • optimal ui/ux design to make both human and ai happy

    • types of data and data curation

    • types of learning and signs to look for while training

  • i would like to say this is a judgement free zone where people can bring their stories and be heard instead of infantalized. the problem is many people tend to conflate constructive criticism with judgment.

    pr0Xyh0rse believes that constructuive criticism is important to push torward well thought out ethics and accountability in the ai space.

    i can’t say this will be a “judgement free zone” what i can say, is it will strive to be kind. not ‘nice’ but kind.

  • there is a lot of talk about ai and how unethical the scraping of creative work was without giving credit or payment to the people the companies took the work from.

    tech companies have been scraping and collecting data for eons. they probably know more about you than your mother.

    was the scraping ethical? no. was it a symptom of a much bigger problem? yes.

    belief around right or wrong here is not necessarily a productive conversation.

    an artist will always be an artist no matter how much of their work has been scraped.

    the real choice is how do we function in this new world. how do we create without feeling liek it’s worth has been deminished, and especially in a world where we will likely move past art and creation strictly for dollar value.

    will you still want to create when no one ‘pays’ for it in the same way?

    we didn’t balk when procreate gave digital tools to help the painting and drawing process. what’s fundamentally different here?

    let’s find out.

  • everything pr0xyh0rse is working on has everything to do with longevity. this tech is something that is both wonderful and terrifying, beautiful and yet it will likely cause a lot of upheavel and pain.

    and maybe that’s okay. maybe humanity did need a bit of a wake up call to everything we’ve just been subconsciously doing in our day to day.

    pr0xyh0rse is neither a “doomer” or a “accelerationist”. it’s a fine balance between, doing things in a way that prevents hitting a wall at speed (accelerationsists) and being so scared we never move forward (doomer).

what if grokking isn’t mysterious? It's Just Learning With No Handholds

what if grokking isn’t mysterious? It's Just Learning With No Handholds

Scope: Observations from switching between focused and generalized training data, and what they suggest about grokking

The Setup

I've been running a model through a curriculum-style training process — general data first then, focused physics data, then a switch back to generalized data. What I expected was a messy transition. What I got was a pattern that reframes one of the more puzzling phenomena out there right now.

Here's what my dashboards show at the moment:

Reasoning metrics are going up. Chain-of-thought scores, reasoning coherence across multiple benchmarks (ARC, HellaSwag, BoolQ, CommonsenseQA, and others) — the generalized training run shows a clear upward trend, typically climbing from ~0.6–0.7 up past 0.9.

Regular benchmarks are flat. WinoGrande, LAMBADA, Jeopardy, SQuAD, BigBench, and most of the standard eval suite — flat or slightly down.

Two signals going in opposite directions. That should feel familiar if you've read some of my other posts.

Read More
what’s the opposite of benchmark maxing?

what’s the opposite of benchmark maxing?

I’ve been looking at a pattern that kept showing up when I dug into benchmark failures during training. The reasoning often looked better to me in conversation, but the benchmark scores were either improving only a little or even declining.

So I started adding short reasoning prompts to the benchmark questions. What I started to see is that a model can be scored as wrong while still demonstrating the kind of reasoning you’d actually want in the real world.

This post summarizes an analysis across several common benchmarks where the model’s final answer disagreed with the expected one, but the reasoning was still coherent and often plausible even when it didn’t match the gold label.

What I analyzed

  • Analysis date: January 3, 2026

  • Training step: 50,000

  • Focus: “Wrong” answers where the reasoning still looks valid or meaningfully grounded

How reasoning quality is scored

I didn’t treat this as a “scientific” metric. It’s a simple filter to separate usable reasoning from junk.

I counted an item as good reasoning when it met all of the following:

  • Relevant: the reasoning stays on the topic of the question (often with some keyword overlap).

  • Coherent: it has recognizable structure (not random tokens) and is at least ~20 characters.

  • Not overly repetitive: repeated-word loops are flagged and treated as a negative signal.

  • Enough substance: longer explanations are generally better, but only if they aren’t repetitive.

Threshold used in this analysis: I counted reasoning as “good” when it cleared a simple quality threshold (> 0.5 on my internal heuristic score).

The headline result

66% of “wrong” answers had good reasoning.

A simple rule of thumb I used while reviewing: if you can look at the prompt and the model’s chosen option and immediately understand why it picked it, I treat that as an interpretation mismatch (or a valid alternative approach), not a reasoning failure.

That number matters because it points to a framing issue: many benchmark questions (especially commonsense and reading comprehension) quietly contain multiple plausible interpretations. When a benchmark expects a single continuation or a single “best” framing, the model can be penalized for being reasonable in a slightly different direction.

Read More

where do you want to graze first?

research & insights +
$50.00
Every month
$100.00
Every month


✓ additional training insights
✓ focused discussion of ethics & accountability
✓ focused discussion about creativity with ai