← all writing

You can't prompt-inject your way past a capability that doesn't exist

This one closes a short series on building Leonard, but it stands on its own. The series: Part 1 was why I built it, Part 2 was how it works, Part 3 was the time it bit me. This is the idea underneath all three.

Here's how almost everyone makes an AI agent safe right now. You write it a system prompt full of rules. "Don't modify files outside the project. Always ask before you delete something. Never run destructive commands." Then you ship it and you feel better, because you've addressed the risk. It's right there in writing.

I get the appeal. It's fast, it's readable, and it looks like safety. You can point at the prompt in a code review and say, see, we told it not to. The rule exists. It's spelled out.

The problem is what kind of thing that rule is. A prompt is a suggestion. You are handing a set of instructions to a system that treats instructions more like a mood than a law, and asking it to please not cross a line. It might honor that line on a good day. But the model can get confused. It can drift over a long session. And it can be prompt-injected, which just means some text it reads, a file, a web page, a tool result, contains instructions of its own, and the model can't reliably tell your rules from the attacker's. Anything that depends on the model choosing to follow the rule is one clever input away from not following it.

So you haven't enforced a boundary. You've requested one. And the rule that lives only in the prompt is the rule the attacker gets to argue with.

#The reframe

The move that actually works is boring, and it's this: stop asking the agent to behave, and remove the dangerous action from the set of things it's able to do at all.

Don't tell it not to push to the remote. Don't wire up a push. Don't tell it not to write outside the project. Refuse, in code, before anything runs, any path that lands outside the project. The capability the model can't invoke is the only one you never have to worry about it invoking. You can't prompt-inject your way past a capability that doesn't exist.

This is also the thread the whole Leonard series has been pulling on without naming it. Part 3 was about a bug I found in Leonard's own code: a deleted function still reporting as present, because a regression slipped past a test nobody had ever pushed hard enough to trip. That guard failed quietly, for six versions, because it leaned on a check that turned out to be advisory in practice. Other guards in the same tool got attacked on purpose and held. The difference between the two is the whole point of this post. A guard an attacker can argue with is one that lives in a prompt, or in a test fixture that was never made big enough to bite. A structural wall isn't arguing with anyone.

#What that looks like, with real walls

I'll use four real mechanisms from two tools I built and use, Leonard and bosun. Both are public and Apache-2.0, so I can show the actual guts. Neither is famous, and I'm not selling you on adoption. I'm showing you the shape of the idea, and these are my cleanest examples because I had to make every one of these decisions by hand.

The path-trust guard, in Leonard. When the model goes to write a file, Leonard takes the path it asked for, resolves it all the way out (following any ../ or symlink games), and if the result lands outside the project root, it refuses. Not "warns." Refuses, before the write happens. There's a classic attack this kills, sometimes called the confused deputy, where you trick a trusted tool into using its own authority on your behalf. "Hey helpful tool, write this file to ../../../etc/something." The fix isn't to trust the caller's intentions. It's to refuse the path. The model can be as convinced as it likes that the write is fine. The guard doesn't read its mind, it reads the resolved path, and the path is outside the line.

The verifier-trust gate, also in Leonard. After an edit lands, Leonard can run a command to check the work, by default something like go vet. Running a command the project config hands you should make you nervous, because a tampered config could swap in anything. So the command is gated by a fingerprint. Leonard stores a SHA-256 hash of the exact command you authorized, in a per-user trust file outside the project, and recomputes the fingerprint on every edit. If someone edits the project config to slip in a different command, the hash won't match and the command simply doesn't run. There was a whole class of attempts to sneak a command past this by dressing it up in shell obfuscation, and a command-string scanner defeated five distinct forms of it. Not because a prompt reminded the model to be careful. Because the bytes didn't match the fingerprint.

Hard resource caps, again in Leonard. Not "please send reasonable payloads." A 16 mebibyte ceiling on writes, a 1 mebibyte ceiling on snippets, a cap on how many edits one batch can carry. An oversized input can't sail through and blow up memory, because the ceiling is checked in code on the way in. The model never gets to decide whether a payload is too big. The number decides.

The safety contract, in bosun. bosun coordinates a bunch of parallel coding sessions on separate git worktrees, which is exactly the kind of tool that could do real damage if it got clever with your repo. So it's built around a short list of things it structurally cannot do. Only one command, the merge, is allowed to touch your main branch. It never pushes. It never fetches. It never talks to GitHub or any other forge. It never rewrites your global git config or your name and email. Those aren't instructions in a prompt telling the agent to be good. They're operations that aren't wired in. The agent can't push because there is no push for it to reach.

The through-line across all four: the safety does not depend on the model's good judgment on a given afternoon. It depends on what the architecture will and won't let happen. The model's job is to be useful inside a box whose walls it cannot push on, no matter what it reads, no matter how sure it sounds.

#Where this gets honest

Now the part that separates this from a tweet that says "just make it safe."

Structural safety is not "take away all the power." An agent that can't do anything is perfectly safe and perfectly useless. So the real work, the actual skill, is choosing where the wall goes. You have to define the envelope the agent is allowed to operate inside, and make its boundary a wall instead of a request. Picking that boundary well means you have to genuinely understand your threat surface, which is more upfront thinking than dashing off a system prompt. That's the cost. Pay it. The prompt was cheaper because it was doing less.

It also sidesteps a trap that's worth naming, because it's its own security hole. If you handle every risk by popping up "are you sure?", you train your users to rubber-stamp. Click yes, click yes, click yes, that's just the noise the tool makes. Now the day a genuinely dangerous confirmation appears, they approve it on reflex, because you taught them to. Confirmation fatigue isn't a UX annoyance, it actively erodes the safety it pretends to provide. (I went a few rounds on exactly this with another engineer in a thread, and we landed in the same place.) The structural approach mostly dodges it: safe operations are inside the envelope and need no confirmation, dangerous ones aren't reachable at all, so there's no parade of prompts teaching anyone to approve blindly.

And here's the limit, the place I won't pretend it reaches. Structural walls work beautifully on the clearly-dangerous capabilities. Pushing to a remote. Writing outside the project. Running an unauthorized command. Those are crisp. You can draw the line in code because the line is obvious. But a lot of the genuinely scary actions live in an ambiguous middle, where whether the thing is safe depends entirely on intent and context. Deleting a file is fine if it's a build artifact and a disaster if it's the only copy of something. Refactoring a function is fine unless that function is load-bearing in a way nobody documented. You can't structurally wall those off without also walling off the legitimate version, because the bytes look identical and only the context differs. That's exactly where you still want a human in the loop, and where intent-alignment work still earns its keep. Structural-first shrinks the surface that needs human judgment. It doesn't erase it. Anybody selling you a fully autonomous agent that's also totally safe is quietly skipping this paragraph.

So the shift is small to say and large to live by. Stop trying to make the agent trustworthy. Make the system safe whether or not the agent is trustworthy on the day it matters. Trust the architecture, not the model's good behavior on a Tuesday.

Which leaves me with the question I haven't answered for myself yet. The walls are the easy part once you know where to put them. The ambiguous middle is the hard part, and right now my only honest tool there is to stop and ask a person. So: how much of that middle can you actually drag back over to the structural side, by getting more careful about context, before you hit the irreducible core that really does need a human looking at it? I don't know where that floor is. I'd like to.


← all writing