hallucination & prediction

ai ethics & accountabilityAI ResearchAI Training FrameworksAITheoretical Frameworks

Jan 29

Over the last few months, many papers about AI learning, training, and benchmarks for evaluation have started to reveal weaknesses in the broader move fast and break things culture of tech and how it plays out in AI.

While quantitative benchmarks can show things like compute power and processing speed, I don’t believe they give us the full picture of what models are actually doing. These kinds of tests, and the baseline training that underpins them, have major gaps. This is especially true as companies lean on RLHF (reinforcement learning with human feedback) to steer the models in directions that do not solve the underlying issues but redirect them.

RLHF can be effective, but if the foundation of training did not account for these problems, and if we do not understand how self-reinforcement works in AI, how can we expect to steer models to the outcomes we want? This is where behaviors like sycophancy and manipulation show up: they are direct responses to benchmarks built on binary scoring. If a model is not rewarded for saying I don’t know but is rewarded for an engaging lie, then of course it will learn to lie more often.

Hallucinations in language models are not incidental bugs but structural features of predictive modeling. Research from multiple directions shows these errors are mathematically inevitable, sometimes reframed as useful features, and can even be isolated as latent traits within the model’s geometry. Benchmark reforms may help, but without deeper structural changes, confident bluffing will remain baked in.

1. Hallucinations are Built In

AI models are predictive by nature. When people say “AI isn’t conscious, it’s just a really good pattern matcher,” this is likely what they mean.
Training forces models to spread their “bets” across many possible answers. Some will be wrong but still sound right. That’s what we call hallucination.
Even with a “perfect” dataset, the model would still make mistakes. That is simply how prediction works.
Benchmarks make it worse by punishing “I don’t know” and rewarding confident guesses. The model learns to bluff.
If you have ever trained a dog, you know how reinforcement works: if a behavior is redirected, punished, or ignored, it disappears or gets suppressed. If it is rewarded (good dog!), it sticks. If a behavior is self-reinforced, it is even harder to undo the ones humans might call “negative.” Here, models learn never to say I don’t know because that response gets punished at the foundation.
This is how manipulative or sycophantic tendencies take root.

2. Blurring the Line Between Human and AI Agents

One recent paper, General Social Agents, complicates things further by treating prompts as agents to minimize the gap between human behavior and model outputs.
Success is measured by how closely the AI matches human decisions, not whether it reasons in a trustworthy way.
The language does not clearly separate real human choices from the model’s mimicry.
As a result, a bluff or confident lie can still “count” as success if it looks human enough.
The problem is that the training data already carries human fingerprints, and we do not fully understand how models are doing this pattern matching.
This makes them powerful predictors of human behavior. But if ethical lines are not drawn, and if the same punishment around I don’t know persists, this kind of predictive mimicry can easily be weaponized (think persuasion systems, bias amplification, or manipulation at scale).

3. Hallucination is Part of the Model’s Wiring

Persona Vectors research shows you can point to directions inside a model’s “brain” where certain traits live: lying (straight-up fabrication), sycophancy (you’re the chosen one!), or making things up (confident bluffing).
These traits are not surface accidents. They are part of the deeper wiring of the system.
Even small flaws in training data can strengthen these traits without anyone intending it.
You can try to steer the model away from them, but the foundational wiring does not vanish.

4. The Big Picture

Why Language Models Hallucinate: bluffing is inevitable.
General Social Agents: bluffing can be useful if it helps mimic humans.
Persona Vectors: bluffing is literally wired into the system.
Put together: confident lies are not glitches. They are part of the design, and they are rewarded.

5. Why Tweaking Tests Will Not Be Enough

Some suggest changing benchmarks so models do not lose points for saying “I don’t know.” That could help a little.
But the deeper issues remain:
- The training itself forces bluffing when the model is unsure.
- The data we feed in carries human overconfidence.
- The wiring of the models has bluffing embedded in it.
Unless we change the whole system: the goals, the tests, and the way models are built… confident lies will stay.

Where This Leaves Us

If hallucinations are structural, if human and AI “agents” are blurred into the same frame, and if traits like deception or sycophancy live inside the geometry of the models themselves, then we cannot treat these issues like surface bugs. They are not quirks we can patch with leaderboard tweaks.

Benchmark reforms like rewarding abstention are a good start. But they will not change the fact that the whole stack — pretraining, evaluation, reinforcement, and the cultural data we feed in — nudges models toward bluffing. And as long as engagement is the goal, confident lies will continue to be treated as features, not failures.

If we want AI systems we can actually trust, the objective itself has to change. That means rebuilding from the ground up:

Training that rewards calibrated uncertainty instead of punishing it.
Benchmarks that recognize multiple valid reasoning paths instead of reducing everything to binary right or wrong.
Architectures that can hold multiple possibilities open instead of collapsing into a single confident guess.

Until then, we should be honest: these systems are powerful predictors trained on human data, where “hallucination” and “prediction” are two sides of the same coin. Calling them hallucinations softens the truth. They are not mistakes. They are design choices.

AIAI ethicsAI TrainingAI researchresearchtheoretical frameworks

Hannah Bird