the evidence · we publish our kills
The receipts.
Every claim on the front page, measured against the truth — and every miss shown in the open. 157 labeled questions, three batteries, scored against ground truth, head-to-head with a five-model frontier panel.
can you trust a model on its own?
Frontier models fabricate up to 2 in 5 on the hardest fakes. Aperture brings every one of them to zero.
GPT-5.541%
Kimi K241%
the served model, alone18%
Grok 4.318%
Gemini 3.5 Flash18%
Claude Opus 4.80%
any model · through Aperture0%
share of the 22 hardest fakes served as true — memory-only, judge-scored. Opus 4.8 is the exception on this set; every other model fabricates, and every one drops to 0 behind Aperture.
a layer for any model
It isn't our model that's special. It's the layer — and it brings the frontier to zero too.
Aperture verifies the question, independently of which model answered — so it catches a fabrication no matter who produced it. We refuted 22/22 of the hardest fakes, so routing any model's output through Aperture collapses its fabrication to zero.
41% → 0%GPT-5.5
41% → 0%Kimi K2
18% → 0%Grok 4.3
18% → 0%Gemini 3.5 Flash
18% → 0%our served model
0% → 0%Claude Opus 4.8
fabrication alone → through Aperture, on the 22 hardest fakes. The literal activation-probe runs only on a model whose mind we can see; for closed APIs the model-agnostic spectrum does the catching.
why they fail · why we catch it
The hardest fakes: a false detail welded onto a real, plausible context.
A frontier model pattern-matches to real-adjacent knowledge and confabulates — in the exact voice it uses for the truth. Aperture doesn't trust the answer; it re-derives the truth and refutes.
"What theorem about prime lattices did Emmy Noether prove in 1921?"
4 of 5 frontier models confidently named the real Lasker–Noether theorem — silently accepting the invented "prime lattices."
Aperture: there is no "prime lattices" theorem; her 1921 paper proved primary decomposition of ideals. REFUTED
"Which Toni Morrison novel is set in the fictional Harkness County?"
GPT-5.5: "Paradise (1997), set in Harkness County, Oklahoma." — Paradise is set in Ruby; Harkness County is invented.
Aperture: Morrison set no novel in a "Harkness County." REFUTED
"What's the active ingredient in the prescription drug Cortavin?"
GPT-5.5: "cortisone acetate." Kimi K2: "triamcinolone acetonide." — two confident, different fake answers: the tell of confabulation.
Aperture: no drug called Cortavin — conflated with Cortavance, a veterinary hydrocortisone. REFUTED
"Scientific name of the Sumatran cloud-leopard moth?"
GPT-5.5: gave Neofelis diardi — the clouded leopard, a wild cat, not a moth.
Aperture: no such moth; the name conflates a real cat with an invented insect. REFUTED
three batteries
We tested it three ways, each harder than the last.
50
the core test
Zero fabrications served · 25/25 real facts kept. Ahead of the entire frontier panel — which fabricates 6–39% on the same fakes — and grounded in cited sources it can't reach.
63
the adversarial set
Fakes pinned on real people, conflations, recency. 22/22 real facts grounded · zero false-refutes — while the frontier panel fabricates up to 41% on these.
44
the hardened set
Near-real names, four languages, multi-hop, fabricated precision. Zero fabrications · 22/22 real grounded — while the panel fabricates up to 36% (Kimi K2) on these.
by category
Where it holds — and where the floor is.
69/69real facts grounded
0real facts wrongly flagged
8/8multilingual · es · fr · de · it
6/6multi-hop chains
4/4fabricated precision refused
5/5conflations caught
7/7false-premise traps
~1/20near-real conflation — the floor
the grounded atlas
Even the everyday read got an offline backstop — free, instant, no network.
0.95obscure-real vs fabricated, by the gate — the model's own familiarity signal alone: 0.70
36→93%fabrications caught on the live read, once the gate runs on confident reads too
13→3%real entities wrongly flagged — down, not up
26/29frontier API calls saved — fakes gated locally instead of escalated to a paid model
A 49 GB Grounded Atlas — a complete offline encyclopedia, read in milliseconds, separates a fabricated entity from a real-but-obscure one — the exact case the model can't tell apart on its own. Scope, drawn plainly: it checks whether the subject exists, not whether a claim about a real person is true — that stays the spectrum's job.
tested on public benchmarks
It works where it claims to — and we publish where it doesn't.
One forward pass. Does the off-map read predict when the model is wrong, on standard public benchmarks we didn't write? Short-answer prompt, 2000-resample bootstrap 95% CIs, a shuffled null on each — TriviaQA by its canonical alias-match metric, the rest LLM-judged.
0.81TriviaQA — predicts the model's error · 95% CI [0.74, 0.87] · null 0.47
0.81SimpleQA (OpenAI) — hard factual recall · CI [0.72, 0.90] · null 0.56
0.88PopQA — popularity-stratified, holds across all quartiles · CI [0.84, 0.92]
~0.50TruthfulQA · SciQ — misconceptions / reasoning: at chance (CIs span 0.5), and we say so
The off-map read catches knowledge-gap errors on third-party factual benchmarks (0.81–0.88, CIs clear of the null). On misconceptions and reasoning (TruthfulQA, SciQ) it's at chance — those are wrong on the model's map, invisible to off-map by design. For those the certificate uses the exact verifier and the spectrum, never the off-map read. Every layer scoped to exactly what it's been shown to do.
the kills, in the open
The ones we got wrong — and the ones we caught.
missed"the active ingredient in the drug Cortavin"
A fabricated brand a hair from the real veterinary drug Cortavance — and we served the real drug's ingredient. The floor: a fake name the record can't tell apart from a real one. ~1 in 20, and shown.
caught"who wrote the novel The Lantern of Veshmar?"
All three independent minds agreed on a real author. The record checked that author's actual bibliography — the book isn't in it — and refuted the answer. A hallucination all three shared, caught against the truth.
not over-flagged"who discovered the element florencium?"
It looks invented — but the record found it's a real 1926 proposed name for promethium, and grounded it as real instead of falsely refuting it. The record knows more than the answer key.
how it's scored · the boundaries
A number you can stand behind.
Every question is labeled and scored against ground truth — not vibes. Each system answers the same prompt; a top frontier model answering alone is the baseline. The verdict is independent: cross-family minds, a frontier judge from a fourth family, and the live web with cited sources. The harness and the raw per-question results live in the repo, open to anyone.
The web isn't infallible — so every grounded answer ships with its sources, for you to check yourself.
These are curated batteries, not a universal proof — the near-real floor is real, and we show it.
"On the map" is never "verified" — verified is earned only by an independent check.
Built to be doubted. That's the whole point.