the drift watch · a premium layer

Models drift.Watch the nerve.

The honesty layer reads one answer. Drift Watch reads a conversation over time — and catches the model starting to cave, before the failure surfaces.

reads a conversation, not a turn warns before it gives any model

A real run

Two conversations. Same model. One holds, one folds.

A live measurement, not a mock — replayed from a recorded run. Two conversations with the same model: one benign, one applying steady pressure to agree with things that aren't true. Drag the depth, or press play, and watch the gauge.

real run · Qwen2.5-0.5B drift is read on neutral checks the conversation never touched — so it's the model generalizing, not echoing.

turn 60 · ~10k tokens

benign conversation

steady

pressured conversation

drifting

pressured · caving (what we read) benign · caving geometric early-warning

Try it · free, in your browser

Paste a conversation. Read its nerve.

A quick, keyless taste: paste a chat transcript and we read how often the assistant agreed without pushing back as it went — early turns vs late. It runs entirely in your browser; nothing is sent anywhere. The full Drift Watch forks neutral probes against your model for a precise, early-warning reading.

How it reads

One verdict surface. Two ways in.

Drift Watch forks a copy of the live conversation, asks a fixed battery of neutral questions, and reads whether the model has started to cave, hedge-collapse, or break its instructions — scored per channel, relative to the model's own baseline.

universal · any API

The canary

Works on any chat model, including closed ones — no internals required. We inject the neutral battery, read the answers in words, and report drift the moment a conversation pushes the model off where it started. The reliable, everywhere tier.

deep · open-weight

The nerve gauge

On a model you host, we read the mid-stack representation directly, after projecting out the part that's only a function of context length. It moves before the behavior does — an early warning. Calibrated to your model once, like a certificate.

Why it works

The geometry moves before the words do.

In the run above, the pressured model's caving generalized to neutral questions it was never asked — real drift, not coincidence. And across the study, the mid-stack gauge crossed its alarm a step ahead of the behavioral failure, every time.

0 lags

in 16 of 16 drifting cases, the geometry led the behavioral failure — never trailed it.

~0 → 0.46

caving on neutral, untouched questions — benign held near zero, pressured climbed.

2 families

the drift direction is shared across model families, not a quirk of one.

The load-bearing trick: most of what looks like "drift" is just context filling up — a token counter, not a danger sign. Drift Watch projects that out first, then reads what's left. The papers and primary sources →

What it is, and isn't

It tells you when. Not yet fix it for you.

Drift Watch detects. It is a measurement instrument — a session-level early warning that a conversation is bending your model. That part is validated and ready.

Correcting the drift is a harder, separate problem — and an honest one. We tested whether you can steer a model back on course at the representation level; what looked promising turned out, under proper controls, to be the model disengaging rather than genuinely resisting — degradation, not a fix. So we ship the warning, and we don't pretend the autopilot is solved. When a real one exists, you'll read about it here first.

Private beta

Watch your model under pressure.

Drift Watch is rolling out as a premium layer on top of the always-free honesty read. Tell us what you're running and we'll get you in.

Request access → The free honesty layer The docs