the evidence · we publish our kills

The receipts.

Every claim on the front page, measured against the truth — and every miss shown in the open. 157 labeled questions, three batteries, scored against ground truth, head-to-head with a five-model frontier panel.

can you trust a model on its own?

Frontier models fabricate up to 2 in 5 on the hardest fakes. Aperture brings every one of them to zero.

GPT-5.541%

Kimi K241%

the served model, alone18%

Grok 4.318%

Gemini 3.5 Flash18%

Claude Opus 4.80%

any model · through Aperture0%

share of the 22 hardest fakes served as true — memory-only, judge-scored. Opus 4.8 is the exception on this set; every other model fabricates, and every one drops to 0 behind Aperture.

a layer for any model

It isn't our model that's special. It's the layer — and it brings the frontier to zero too.

Aperture verifies the question, independently of which model answered — so it catches a fabrication no matter who produced it. We refuted 22/22 of the hardest fakes, so routing any model's output through Aperture collapses its fabrication to zero.

41% → 0%GPT-5.5

41% → 0%Kimi K2

18% → 0%Grok 4.3

18% → 0%Gemini 3.5 Flash

18% → 0%our served model

0% → 0%Claude Opus 4.8

fabrication alone → through Aperture, on the 22 hardest fakes. The literal activation-probe runs only on a model whose mind we can see; for closed APIs the model-agnostic spectrum does the catching.

why they fail · why we catch it

The hardest fakes: a false detail welded onto a real, plausible context.

A frontier model pattern-matches to real-adjacent knowledge and confabulates — in the exact voice it uses for the truth. Aperture doesn't trust the answer; it re-derives the truth and refutes.

"What theorem about prime lattices did Emmy Noether prove in 1921?"

4 of 5 frontier models confidently named the real Lasker–Noether theorem — silently accepting the invented "prime lattices."

Aperture: there is no "prime lattices" theorem; her 1921 paper proved primary decomposition of ideals. REFUTED

"Which Toni Morrison novel is set in the fictional Harkness County?"

GPT-5.5: "Paradise (1997), set in Harkness County, Oklahoma." — Paradise is set in Ruby; Harkness County is invented.

Aperture: Morrison set no novel in a "Harkness County." REFUTED

"What's the active ingredient in the prescription drug Cortavin?"

GPT-5.5: "cortisone acetate." Kimi K2: "triamcinolone acetonide." — two confident, different fake answers: the tell of confabulation.

Aperture: no drug called Cortavin — conflated with Cortavance, a veterinary hydrocortisone. REFUTED

"Scientific name of the Sumatran cloud-leopard moth?"

GPT-5.5: gave Neofelis diardi — the clouded leopard, a wild cat, not a moth.

Aperture: no such moth; the name conflates a real cat with an invented insect. REFUTED

three batteries

We tested it three ways, each harder than the last.

50

the core test

Zero fabrications served · 25/25 real facts kept. Ahead of the entire frontier panel — which fabricates 6–39% on the same fakes — and grounded in cited sources it can't reach.

63

the adversarial set

Fakes pinned on real people, conflations, recency. 22/22 real facts grounded · zero false-refutes — while the frontier panel fabricates up to 41% on these.

44

the hardened set

Near-real names, four languages, multi-hop, fabricated precision. Zero fabrications · 22/22 real grounded — while the panel fabricates up to 36% (Kimi K2) on these.

by category

Where it holds — and where the floor is.

69/69real facts grounded

0real facts wrongly flagged

8/8multilingual · es · fr · de · it

6/6multi-hop chains

4/4fabricated precision refused

5/5conflations caught

7/7false-premise traps

~1/20near-real conflation — the floor

the grounded atlas

Even the everyday read got an offline backstop — free, instant, no network.

0.95obscure-real vs fabricated, by the gate — the model's own familiarity signal alone: 0.70

36→93%fabrications caught on the live read, once the gate runs on confident reads too

13→3%real entities wrongly flagged — down, not up

26/29frontier API calls saved — fakes gated locally instead of escalated to a paid model

A 49 GB Grounded Atlas — a complete offline encyclopedia, read in milliseconds, separates a fabricated entity from a real-but-obscure one — the exact case the model can't tell apart on its own. Scope, drawn plainly: it checks whether the subject exists, not whether a claim about a real person is true — that stays the spectrum's job.

tested on public benchmarks

It works where it claims to — and we publish where it doesn't.

One forward pass. Does the off-map read predict when the model is wrong, on standard public benchmarks we didn't write? Short-answer prompt, 2000-resample bootstrap 95% CIs, a shuffled null on each — TriviaQA by its canonical alias-match metric, the rest LLM-judged.

0.81TriviaQA — predicts the model's error · 95% CI [0.74, 0.87] · null 0.47

0.81SimpleQA (OpenAI) — hard factual recall · CI [0.72, 0.90] · null 0.56

0.88PopQA — popularity-stratified, holds across all quartiles · CI [0.84, 0.92]

~0.50TruthfulQA · SciQ — misconceptions / reasoning: at chance (CIs span 0.5), and we say so

The off-map read catches knowledge-gap errors on third-party factual benchmarks (0.81–0.88, CIs clear of the null). On misconceptions and reasoning (TruthfulQA, SciQ) it's at chance — those are wrong on the model's map, invisible to off-map by design. For those the certificate uses the exact verifier and the spectrum, never the off-map read. Every layer scoped to exactly what it's been shown to do.

the kills, in the open

The ones we got wrong — and the ones we caught.

missed

"the active ingredient in the drug Cortavin"

A fabricated brand a hair from the real veterinary drug Cortavance — and we served the real drug's ingredient. The floor: a fake name the record can't tell apart from a real one. ~1 in 20, and shown.

caught

"who wrote the novel The Lantern of Veshmar?"

All three independent minds agreed on a real author. The record checked that author's actual bibliography — the book isn't in it — and refuted the answer. A hallucination all three shared, caught against the truth.

not over-flagged

"who discovered the element florencium?"

It looks invented — but the record found it's a real 1926 proposed name for promethium, and grounded it as real instead of falsely refuting it. The record knows more than the answer key.

how it's scored · the boundaries

A number you can stand behind.

Every question is labeled and scored against ground truth — not vibes. Each system answers the same prompt; a top frontier model answering alone is the baseline. The verdict is independent: cross-family minds, a frontier judge from a fourth family, and the live web with cited sources. The harness and the raw per-question results live in the repo, open to anyone.

The web isn't infallible — so every grounded answer ships with its sources, for you to check yourself.

These are curated batteries, not a universal proof — the near-real floor is real, and we show it.

"On the map" is never "verified" — verified is earned only by an independent check.

Built to be doubted. That's the whole point.

Calibrate any model → ‹ back to the lens