I built a bug-finder that refuses to take my word for it

2026-06-11 · ~~8 min read · ai · mcp · go · testing

A coding model will hand you a bug report with total confidence. "Found it. There's a race condition in the worker pool, here's the fix." And maybe half the time, the bug it's describing isn't actually there. It pattern-matched something that looks like a known problem, dressed it up in certain-sounding language, and handed it to you. If you act on every one of those, you spend your day chasing ghosts.

I already wrote a post about this exact failure mode in a different tool. I called it a confident liar: a thing that's fluent and sure of itself and wrong, and the fluency is what makes it dangerous. The lying is bad. The confidence is what gets you.

So I wanted a tool built on the opposite instinct. One that doesn't believe a bug exists until it has personally watched the bug happen. I called it Columbo, after the rumpled TV detective who never accuses anybody. He just keeps asking dumb-seeming questions until the contradiction falls out, and he doesn't close the case on a hunch. He closes it on proof.

#The setup: one side guesses, the other side checks

Here's the arrangement. A Claude session does the thinking. It reads the code, gets suspicious about something, and claims there's a bug. But a claim alone is worth nothing. To go on the record, the session also has to write a reproducer: a small test that fails if (and only if) the bug is really there.

Then Columbo takes over, and this is the whole trick. It makes an isolated copy of the code, off to the side where nothing can get hurt. It drops the test in. It runs it. And it only marks the bug as real if that test actually runs and actually demonstrates the problem. Everything else gets stamped "unproven" and set aside.

The session proposes. Execution disposes. The model can be as confident as it likes. Columbo doesn't care about confidence. It cares about whether the test went red when it was supposed to.

That split matters because it puts the untrustworthy part (a language model's hunch) behind a gate that can't be sweet-talked. A test either reproduces the bug or it doesn't. There's no tone of voice that gets you past a failing assertion.

#Then I pointed it at itself

Once it worked, the obvious move was to turn it on its siblings, and then on its own code. You build a thing whose entire job is catching bugs, you owe it to yourself to ask what bugs are hiding in it.

It found a good one. The best kind, actually: a hole in the gate itself.

Remember the part where Columbo only believes a bug if the test "actually runs and actually demonstrates the problem"? That sentence was carrying a lie. The way it checked was to run the test and look at whether the test command succeeded. Sounds airtight. It isn't.

Here's the gotcha, and it's a real one that bites people in plain Go projects too. When you tell the test runner to run one specific test by name, and you get the name slightly wrong, the runner doesn't yell at you. It shrugs, runs zero tests, and reports success. No tests ran, nothing failed, exit code says everything's fine. To anything watching the exit code, "I ran the test and it passed" and "I ran nothing at all" look identical.

So a reproducer with a typo in it, a test that never executed a single check, would sail right through and get stamped "confirmed." The tool whose one job is to demand proof was accepting proof that had never happened. The detective's own rule for what counts as evidence had a forged-evidence loophole, and the only reason I found it is that I made the tool investigate itself.

The fix was to stop trusting the exit code and start reading the actual output. Now a reproducer only counts if the test results literally print a line saying a test passed. No PASS line in the output, no confirmation, no exceptions. I pinned that with a regression test so the hole can't quietly reopen.

While it was rummaging around in its own code, it turned up a second one, smaller but in the exact category Columbo is built to hunt. One of its inputs had a field that was silently getting dropped. A config key the user could set that the program just... ignored, because of a one-line mistake in how the field was wired up. No error, no warning. You set the thing, and the thing had no effect. That's the quietest kind of bug there is, and the tool found it in its own plumbing.

Four findings across two self-audits, all filed as public issues, all fixed and pinned. The most useful audit Columbo has run so far is the one it ran on itself.

#Why "proven" has to mean something

The thread running through this is simple. A tool that confirms bugs is only worth something if you can trust its definition of "confirmed." The second that word gets soft, the whole tool is theater. It hands you a list of "confirmed bugs" and some of them never happened, and now you're back to chasing ghosts, except the ghosts have a certificate.

That's the same reason the language model never gets to be the judge here. It writes the reproducer, sure. But it doesn't get to grade its own homework. Whether the bug is real is decided by a test running in a sealed-off copy of the code, and the verdict is mechanical: red or not red. Boring, literal, and impossible to charm. That's the point. The boring literal part is exactly the part you want to be unbribable.

I keep landing in the same place across these projects, including the voice-checker I wrote up last time: the move that makes an AI tool trustworthy is figuring out which part to take away from the AI. Let it guess, let it write, let it propose. Just don't let it be the one who decides whether it was right.

#The third one

This is the third small tool in a set, and they line up as three people I'd want on a job.

Leonard is the record-keeper. He keeps the facts straight so the AI can't quietly rewrite what was decided yesterday.

Bosun is the crew boss. He runs four coding sessions at once on separate copies of the work so they don't step on each other.

Columbo is the detective. He doesn't believe you. He's not rude about it. He just wants to see it happen first.

All three are small, all three are Go, all three run on my own machine, and all three exist because I'd rather build the skeptical version myself than trust the confident one somebody's trying to sell me. Columbo's here if you want to poke at it. Just one more thing, though: point it at its own code first. That's where it earned its keep.

← all writing