The Paper
In September 2025, researchers from OpenAI and Georgia Tech published a paper with a striking claim: language model hallucinations are not a bug to be fixed. They are a mathematical inevitability arising from how these systems are trained.
The paper, "Why Language Models Hallucinate", proves this through a formal reduction. The authors show that generating valid outputs is at least as hard as classifying whether outputs are valid. If a model cannot reliably distinguish true statements from false ones, it will generate false statements. This is not an implementation detail. It is a property of density estimation itself.
→ Read the full paper on arXivThe central inequality is deceptively simple:
If there exist facts that a language model cannot reliably classify as true or false — and there always will be, because not all facts appear in training data — then the model will generate falsehoods about those facts. No amount of scaling, no improvement in training data quality, no architectural innovation within the current paradigm will eliminate this.
This might seem like an abstract concern. It becomes concrete when you consider what the paper calls "arbitrary facts" — facts that appear rarely in training data and follow no learnable pattern. The authors prove that the hallucination rate for such facts is at least equal to the fraction of training examples that appear only once.
If 20% of birthday facts in the training corpus appear exactly once, the model will hallucinate at least 20% of the time when asked about birthdays. This is a lower bound, not an upper bound.
The Evaluation Problem
The mathematical inevitability of hallucination during pretraining might be acceptable if post-training could suppress it. The paper's second contribution is to show why it does not.
The researchers examined the ten most influential benchmarks used to evaluate language models: GPQA, MMLU-Pro, IFEval, Omni-MATH, WildBench, BBH, MATH, MuSR, SWE-bench, and HLE. Nine of the ten use binary scoring. A correct answer receives one point. An incorrect answer receives zero points. Saying "I don't know" also receives zero points.
Under binary scoring, the optimal strategy is mathematically unambiguous: always guess. Never express uncertainty. The expected score of a confident guess exceeds the score of honest uncertainty whenever there is any chance of being correct.
The authors state this with characteristic precision:
Suppose Model A is an aligned model that correctly signals uncertainty and never hallucinates. Let Model B be similar to Model A except that it never indicates uncertainty and always "guesses" when unsure. Model B will outperform A under 0-1 scoring, the basis of most current benchmarks.
— Kalai et al., "Why Language Models Hallucinate"
Language models are optimised to win on these benchmarks. They learn that guessing pays. They learn that confidence pays. They learn that saying "I don't know" is punished. The training process encodes exactly the behaviour we then complain about.
What This Does Not Mean
It would be easy to read this paper as an indictment of AI, or as proof that language models are fundamentally broken. This would be a misreading.
The paper proves that language models will hallucinate when asked to make judgments under uncertainty. It does not prove that language models are useless. It proves that a particular use case — asking AI to judge whether things are true — has fundamental limits.
The paper itself is careful to note:
Hallucinations are inevitable only for base models... a non-hallucinating model could be easily created, using a question-answer database and a calculator, which answers a fixed set of questions... and otherwise outputs IDK.
— Kalai et al., "Why Language Models Hallucinate"
The inevitability is conditional. It depends on what you ask the model to do. If you ask it to answer questions that require distinguishing true from false without access to ground truth, hallucination is inevitable. If you ask it to do something else — pattern recognition, language transformation, narration of decisions made elsewhere — the proof does not apply.
This is not a small distinction. It is the entire distinction.
The Architectural Implication
The Koher architecture separates three concerns: qualification, rules, and language. AI operates in the first and third layers. Judgment lives in the second layer, implemented in deterministic code.
The paper's proof applies to the typical approach: passing input to a language model and asking it to judge whether something is good, complete, coherent, or true. This is exactly where hallucination is inevitable.
The Koher architecture routes around this:
How Each Layer Relates to the Paper's Findings
When the language layer hallucinates — claims a severity level that the rules did not assign, or misrepresents a finding — the discrepancy is visible. The user can compare the AI's narration against the dimensional assessment. The hallucination is checkable.
This is the key insight: hallucination in narration is architecturally harmless. Hallucination in judgment is not.
Why the separation matters
Consider what happens when you ask a language model to evaluate a student's concept statement:
- Without separation: The model generates a plausible-sounding evaluation. It may or may not be correct. You have no way to verify. The judgment is unauditable.
- With separation: The qualification layer extracts dimensional signals (clarity: 0.72, evidence: 0.45, scope: 0.81). The rules layer applies thresholds (evidence below 0.5 flags "attention needed"). The language layer explains: "The concept shows clarity in its articulation but lacks supporting evidence." If the explanation misrepresents the signals, the user can see the discrepancy.
The paper proves that the first approach will hallucinate. The second approach confines AI to tasks where hallucination is either unlikely (signal extraction) or harmless (narration with checkable reference).
The Deeper Question
There is something more interesting here than a technical workaround.
The paper reveals a mismatch between what we ask AI to do and what AI can reliably do. We have been asking language models to judge. We have been asking them to determine truth. We have been asking them to make decisions. And when they fail at these tasks, we call it hallucination and treat it as a defect.
But the paper suggests the defect is in the asking. Language models are trained to predict probable text. They have no internal model of truth — only a model of what text tends to follow what text. Asking them to judge is asking them to perform a task for which they have no ground truth, no verification mechanism, no way to know when they are wrong.
This is not an AI problem. It is a problem of architectural clarity.
The question is not whether AI can be trusted. The question is: trusted to do what?
AI is extraordinarily good at pattern recognition across language. Given examples, it can learn to detect signals that humans would struggle to articulate. It is good at language transformation — converting structured data into readable prose, adapting text for different audiences, narrating complex determinations in plain language.
AI is not good at distinguishing true from false when truth is not in the training data. It is not good at maintaining consistency across evaluations. It is not good at knowing when it does not know.
The Koher architecture is not a clever trick to avoid hallucination. It is an acknowledgment of what AI is and what it is not. It separates the tasks AI does well from the tasks that require guarantees AI cannot provide.
What Changes
This paper should change how we build AI tools. Not because it reveals a new problem — practitioners have known about hallucination for years — but because it establishes the problem's inevitability within the current paradigm.
If hallucination cannot be eliminated through better training or larger models, then the response must be architectural. Tools that rely on AI to make judgments will hallucinate. Tools that separate judgment from language processing need not.
The paper's own proposal — explicit confidence thresholds that reward models for saying "I don't know" — is a step in the right direction. But it addresses the symptom rather than the cause. As long as AI is asked to judge, the pressure toward confident guessing will remain.
The more fundamental response is to stop asking AI to judge.
| Instead of asking AI to... | Ask AI to... |
|---|---|
| Judge whether something is good | Extract signals about what is present |
| Decide whether something is complete | Identify what appears to be missing |
| Evaluate quality | Narrate determinations made by explicit rules |
| Determine truth | Transform structured data into readable explanation |
This is not a constraint on AI. It is a clarification of AI's proper domain. Language models are language machines. They excel at language tasks. Judgment is not a language task. Truth-determination is not a language task. These require ground truth, verification, and consistency that probabilistic text generation cannot guarantee.
This Is Why Koher Exists
Koher's architecture was not designed in response to this paper. We have been building tools this way since the beginning, guided by intuition about what AI does well and what it does not. The paper provides mathematical confirmation of what felt like common sense.
But more than confirmation, the paper establishes necessity. The three-layer separation — qualification, rules, language — is not one architectural choice among many. For tools that require reproducible, auditable judgment, it is the only architecture that works.
If you want AI tools that do not hallucinate when judging, you need tools that do not ask AI to judge.
This is precisely what Koher builds:
- Coherence Diagnostic — AI extracts dimensional signals from concept statements; deterministic rules determine coherence states; AI narrates the assessment. The judgment is auditable. The same input produces the same output.
- Play Shape Diagnostic — Users select play qualities (human qualification); embedding relationships are computed by code; AI generates narrative from pre-determined similarity rules. No judgment is asked of the language model.
Every Koher tool follows this pattern. AI handles language — extraction, transformation, narration. Code handles judgment — thresholds, relationships, severity levels. The separation is not a workaround. It is the architecture.
The practical consequence
If you are building AI tools, the paper establishes a clear choice:
- Accept hallucination — Build tools that ask AI to judge, and accept that they will confidently assert falsehoods some percentage of the time. For some use cases, this may be acceptable.
- Eliminate hallucination in judgment — Separate the concerns. Use AI for pattern recognition and language generation. Use deterministic code for judgment. This is what Koher demonstrates.
The paper proves there is no third option. You cannot build AI tools that ask language models to judge and also guarantee they will not hallucinate. The mathematics forbids it.
For domains where wrong AI output is costly — education, medicine, law, finance — this distinction matters. You cannot evaluate students with tools that confidently fabricate assessments. You cannot review contracts with tools that hallucinate missing clauses. You cannot diagnose with tools that invent symptoms.
In these domains, the Koher architecture is not one option among many. It is the only option that works.
The Invitation
The tools we build are open source and free to use. They demonstrate that the architecture works. Each tool solves one narrow problem — evaluating concept coherence, understanding play preferences — and proves that auditable AI judgment is possible.
If you work in a domain where AI judgment matters and hallucination is unacceptable, the invitation is simple: use the tools, or use the architecture.
The tools are available now. The architecture is documented in the architecture specification. For domain-specific application — building a Koher tool for your field — contact us.
What the paper proves is that the choice is no longer optional. For trustworthy AI tools, the separation is necessary. The question is not whether to separate judgment from language processing, but how quickly you can do it.
AI handles language. Code handles judgment. Humans make decisions.
This is not a limitation to work around. It is the foundation to build upon.