I pointed Leonard at its own code, and caught it lying

2026-06-10 · ~~7 min read · ai · security · audit · go · leonard

Part 3 of the Building Leonard series. Part 1 was why I built it, Part 2 was how it works. This one is the part where it bit me.

So I had this tool I was pretty proud of. Leonard checks the AI's claims against the real code, stops an edit that references a function that doesn't exist, keeps the confident fictions out of my codebase. I'd been using it every day. I'd put it through round after round of bug hunting, four separate security reviews, every High and Critical I found closed out. I figured the dangerous stuff was behind me and I was down to polishing.

Then I did the thing I make myself do with anything I've shipped: I went back and pointed its own discipline at its own code. Fresh adversarial lens, no mercy, "what would someone trying to break this find that I didn't." I expected a few mediocre findings. What I got was a High that re-introduced the exact failure the entire tool exists to prevent.

A deleted function still verified as existing.

#The lie-catcher had a lie in it

Let that sit for a second, because it sat with me for a while. The one promise Leonard makes is: the model can't reference a thing that isn't really there, because I check the name against an index of what actually exists. And here was Leonard itself, asked "does this function exist," cheerfully answering yes, with a precise file path and line number, for a function whose file I had deleted three days earlier.

I deleted the file. I ran ls and the shell told me it was gone. Then I asked Leonard, and Leonard told me it was there. My tool for catching confident lies had become a confident liar about itself.

I'll spare you the full autopsy here (it's its own post if you want the gory version), but the short shape of it is almost funny. A while back, one of those security reviews added a cap to a function so a giant repo couldn't blow up memory. Smart fix. The problem is that same function had another caller, the part that sweeps out deleted files, and that caller needed all the rows, not a capped page of them. The cap quietly applied to both. So the cleanup sweep silently stopped at the first thousand files, alphabetically, and everything past that never got checked against disk. Delete a file late in the alphabet and its entry just lives in the index forever, still answering "yes, I exist."

The fix was about a dozen lines in one spot. The bug had been shipping for six versions.

#The part that actually rattled me

It wasn't the bug. Bugs happen. It was why the rounds before it missed it.

There was a test for exactly this. A regression test that deletes a file and asserts the index forgets it. It had been passing on every single run. But the test used a fixture of about thirty files, and the cap doesn't kick in until a thousand. So the test was perfectly correct for the bug it was written to catch, and perfectly blind to the bug a later change introduced right next to it. Two people signed off on that cap, and one of them was me. We both checked the thing the cap was supposed to protect. Neither of us checked the other caller standing right there.

Which is, if you back up, the exact same disease as the AI in Part 1. Leonard was right almost all the time. That's precisely what made the rare wrong answer dangerous, because I'd stopped checking. I trusted it. A tool that's right ninety-nine times teaches you to skip looking on the hundredth, and the hundredth is a deleted function reporting for duty. I built a thing to stop me from trusting a confident source on faith, and then I trusted it on faith.

#What dogfooding actually taught me

A few things I keep coming back to.

A check you've never deliberately tried to break is a check you're trusting on hope. The test passed for months. Passing meant nothing, because nobody had ever made the fixture big enough to cross the line where the bug lived. If you've got a guard with a limit in it, your test has to push past that limit on purpose, or the guard is just decoration that happens to be green.

And the real work on a tool like this was never the building. The building was a couple of good weekends. The work is keeping it honest after it ships, when it's earned your trust and stopped earning your scrutiny. That's the unglamorous part, and it's the whole part. It's also, not by accident, the same thing I keep saying about every system I build: the goal isn't to ship it, it's to keep it true over time, durable and checkable, so the version of me six months from now can't quietly lie to himself through it.

There's a deeper idea sitting under all three of these posts, about why some of Leonard's guards held up under attack when this one didn't, and what makes a safety mechanism something an attacker can't just argue with. That's the last part of this series, and it's the one I'm most interested in.

When was the last time you made a test fixture big enough to actually trip the limit it was guarding? Or have you been shipping green checkmarks that never once crossed the line?

← all writing