The Default Assumption
There is a pattern that has been consolidating for over a year. Open-source projects on GitHub were doing it in 2024: query multiple language models with the same prompt, compare their outputs, synthesise a consensus. Google shipped a research agent in December 2024. OpenAI followed in February 2025. Perplexity formalised it as Model Council in early February 2026 — three frontier models queried simultaneously, a synthesiser reconciling their agreements and divergences. Microsoft, most recently, paired GPT and Claude in sequence and called it Critique and Council.
The pattern varies in execution but the assumption underneath is always the same: if the output is good enough, you do not need to see how it got there.
Each iteration adds more models, better orchestration, higher benchmark scores. Microsoft's combined system outscored every individual model on the DRACO benchmark by 14%. This is real. The output is improving.
But "better output" has quietly become the only question the industry knows how to ask. And it is not the only question worth asking.
The Question Nobody Benchmarks
Consider what happens when two models check each other's work. Model A drafts a research report. Model B reviews it for accuracy, citation quality, relevance. The final report is better. The benchmark confirms it.
Now ask: by what criteria did Model B determine that a citation was weak? What threshold separated "acceptable" from "needs improvement"? Would those criteria hold tomorrow, after the next model update, or if the reviewer were swapped for a different model?
These are not rhetorical questions. They simply have no answers — not because the answers are hidden, but because they do not exist in any form a human can inspect. The review happened inside a probabilistic process trained on internet-scale text. The patterns that drove the review are undocumented, unstable, and unauditable.
The same holds for every multi-model system in the current wave. Whether two models check each other or three models are queried in parallel and a fourth synthesises, the judgment that decides what counts as "accurate" or "comprehensive" or "well-cited" lives inside opaque processes. Adding more opaque processes does not produce transparency. It produces better output with the same opacity.
The output improved. The judgment remained invisible.
Knowing that something is better is not the same as knowing why it is better.
Why the Criteria Matter
In some domains, they do not. A reader who wants a well-researched summary of climate policy does not need to inspect the evaluation criteria. They want the best available answer. Output quality is the right metric.
But there are domains where the criteria are the point.
| Domain | What the Criteria Do |
|---|---|
| Education | A student told their concept statement is "unclear" learns nothing. A student shown which dimension is weak — and by what threshold — learns to see their own work differently. The criteria are the lesson. |
| Clinical assessment | A practitioner receiving an AI verdict needs to verify the thresholds against their own clinical judgment. Without visible criteria, the verdict is not actionable — it is an opinion from a source that cannot be questioned. |
| Legal review | A contract clause flagged as problematic must be traceable to a specific criterion. The criterion can be argued, adjusted, overruled. An opaque flag is just a feeling with a confidence score. |
| Design critique | A designer who receives a score cannot grow from it. A designer who sees the scoring logic — and disagrees with part of it — has begun a conversation with the criteria themselves. That is where growth happens. |
In these domains, the question is not "Is the output good enough?" It is "Can I see the judgment that produced it?" — and, critically, can I learn from that judgment, challenge it, refine it?
The entire output-quality race is optimising for the first question. Nobody is working on the second.
Two Cuts
The output-quality approach and the inspectable-judgment approach are not in competition. They are making different cuts.
The single model does everything in one opaque process. Orchestration adds more opaque processes to check the first. Architectural separation makes the judgment layer readable code.
The industry's cut separates by model identity: which model generates, which model reviews, which model judges. The models themselves remain opaque. The innovation lives in the orchestration — the glue between them. This has been the pattern since those early GitHub repos in 2024, through Perplexity's Model Council, through Microsoft's Critique.
Koher's cut separates by type of cognition: pattern recognition (AI), deterministic judgment (code), narration (AI). The middle layer — the one that makes the actual decision — is not a model. It is code you can open, read, and challenge.
| The Industry Separates | Koher Separates |
|---|---|
| Generation (Model A) | Pattern recognition (any AI) |
| Evaluation (Model B) | Deterministic judgment (code) |
| — | Narration (any AI) |
| by model identity | by type of cognition |
These are not better and worse versions of the same idea. They answer different questions. "Which model should handle this task?" is a routing question. "What kind of computation should handle this kind of work?" is an architectural question.
The Inspectability Test
There is a simple test. Ask: "Why did you reach this conclusion?" Then trace where the answer comes from.
In any multi-model system, the chain of explanation never terminates. In architectural separation, the chain terminates in readable code.
In any multi-model system — whether it is a GitHub ensemble from 2024 or Microsoft's Critique from today — "Why is this better?" produces an AI-generated explanation. Push further — "But by what criteria?" — and you get another AI-generated explanation. The chain never reaches ground. Every justification is itself a probabilistic output.
In the Koher architecture, "Why is this the verdict?" opens the Rules layer. The answer is not generated. It is there: line 47, threshold 0.40, set by the domain expert, adjustable via configuration. The chain terminates in something a human can read, challenge, and change.
This is not an abstract distinction. It determines whether someone can learn from the tool's judgment or only accept or reject its output.
What the Race Misses
Orchestration innovates at the routing layer. Architecture innovates at the separation itself.
The output-quality race assumes that if you make the output good enough, the opacity does not matter. For research synthesis, that assumption may hold. For domains where people need to grow from feedback, interrogate criteria, or verify judgment against their own expertise — it does not.
The race also treats AI quality as a single axis. "Better" means higher benchmark scores, fewer hallucinations, more comprehensive coverage. These are real improvements. But they are improvements along one axis — the axis of output — while leaving the axis of judgment visibility entirely untouched.
Two Questions
Both questions are legitimate. But they lead to different places. The orchestration path leads to better outputs that remain opaque — higher scores on benchmarks, fewer hallucinations, more comprehensive coverage. The architecture path leads to inspectable judgment that a human can learn from, challenge, and refine.
I use Claude constantly. I could not build Koher without it. The question was never about AI's capability. It was always about where the judgment should be visible enough that a person can grow from it, not merely receive it.
Summary
The industry is racing to make AI output better. More models, better orchestration, higher benchmarks. This has been the pattern since open-source ensemble projects in 2024. It continues through Perplexity's Model Council and Microsoft's Critique. The output is improving. This is real.
But output quality and inspectable judgment are different things. The first asks whether the answer is good. The second asks whether you can see the reasoning that produced it. The current race is optimising the first while leaving the second entirely unaddressed — because no benchmark for "can you see why?" exists.
Koher separates language from judgment so that the judgment layer — the part that actually decides — is deterministic code a human can read. Not because AI is inadequate. Because some domains need the criteria to be visible, stable, and challengeable. The criteria are where learning happens.
Knowing that something is better is progress. Knowing why it is better is understanding.