← all writing

An MCP server for the multi-host Ollama setup

I run Ollama on two machines at home. A Ryzen 9 box with an RX 6700 XT does the heavy lifting — gemma4:26b lives there, along with a couple of vision models. A Mac Mini handles smaller stuff: embeddings, fast 8B prompts, anything I need on the LAN when the GPU box is busy.

Until this week, "managing" that setup meant SSH into one or both, run ollama ps, decide what to load, run ollama pull if I needed something new. Inside a conversation with Claude, I'd shell out to curl with hand-rolled JSON if I wanted to ask my local vision model about an image. It worked. It was annoying.

This week I made it not annoying. The result is ollama-mcp: a Go MCP server that wraps the Ollama HTTP API and lets Claude do the management directly. Sixteen tools, multi-host aware, configured by a YAML file that knows about my hosts.

#What it does

The tool surface, grouped by what the tool is for:

Model management (9 tools):

Tool What it does
list_models What's installed on a host
running_models What's loaded in VRAM right now (with size + expiration)
model_info Quant level, parameters, template, license for one model (license text omitted by default — Ollama duplicates it into the Modelfile string and adds 20 KB of noise)
model_disk_usage Bytes per family + grand total. "Do I have room to pull a 70B?"
pull_model Install a model (long-running — pulls of big models take 5-15 minutes)
unload_model Evict from VRAM without deleting (the keep_alive: 0 trick)
swap Single-call evict-from + warm-load-to. Useful on single-GPU hosts
copy_model Duplicate a model under a new name (for prompt-tuned variants)
delete_model Remove from disk (gated by confirm: true)

Inference (6 tools):

Tool What it does
generate One-shot text completion
chat Multi-turn conversation with role/content messages
vision Multimodal: image + prompt → text. Image-path inputs gated by a magic-byte check
embed Vector embedding (returns the actual vector + dimension)
benchmark Time a generate call and report tokens/sec for prompt-eval and response-eval. Useful for "is this model faster on the GPU host or the Mac Mini?"
compare_models Same prompt against several models, side-by-side response

Meta (1 tool):

Tool What it does
hosts List configured hosts and per-host default models

Every tool except hosts takes an optional host argument. Omit it and the host marked default: true in config is used. So Claude can either say "tell me what's loaded" (default host) or "tell me what's loaded on mini" (specific host) and both work.

#Why an MCP and not a CLI

I already had a CLI for most of this — ollama itself plus a few wrappers. The friction wasn't doing the work, it was getting Claude to do the work for me without me having to remember the exact CLI invocation.

Before this MCP, asking Claude to look at a screenshot meant something like:

Run base64 -i screenshot.png > /tmp/img.b64, then curl http://localhost:11434/api/generate with a JSON body containing model gemma4:26b, the image as the images[0] field, and a prompt asking what's wrong with the typography.

After this MCP:

Look at /Users/me/screenshot.png and tell me what's wrong with the typography.

Claude calls vision image_path=..., gets back a description, weaves it into the conversation. The friction is gone.

The same shift applies to "is gemma4 hot or cold right now," "swap to qwen2.5-coder:14b," "pull a fresh vision model onto the workstation, I want to try it." The CLI gave me primitives. The MCP gave me a vocabulary that Claude already speaks.

#How it works

The implementation is unglamorous on purpose. Three packages:

The MCP layer uses Anthropic's official Go SDK. Each tool is a typed Go struct with jsonschema tags — the SDK generates the JSON Schema that Claude sees automatically, so the tool surface stays in sync with the code without any manual schema-writing.

The server speaks JSON-RPC over stdin/stdout. There's no HTTP listener and no auth, because there doesn't need to be — the only client is the Claude Code process that spawned it. Anything beyond that lives in the network layer (Ollama's own bind address, my LAN firewall).

#Two design decisions worth calling out

unload_model uses a hack. Ollama doesn't expose a direct "unload" endpoint, but POSTing to /api/generate with keep_alive: 0 makes the server evict the model immediately. I wrapped that as a tool because it's a real workflow — when you're swapping between a 14B coder and a 26B generalist on a single 12GB GPU, you want explicit eviction, not an implicit "wait for the model's keep-alive to expire."

delete_model requires confirm: true. This is the only destructive tool in the set. The schema marks confirm as required, so Claude has to pass it explicitly — there's no path where a vague prompt accidentally deletes a model. It's a small thing but it's the only piece of "you're about to do something irreversible" friction worth adding.

#Wiring it up

Config goes at ~/.config/ollama-mcp/config.yaml:

hosts:
  thor:
    url: http://localhost:11434
    default: true
  mini:
    url: http://192.168.1.50:11434

default_text_model: gemma4:26b
default_vision_model: gemma4:26b
default_embed_model: mxbai-embed-large

And the Claude Code registration:

claude mcp add -s user ollama /path/to/ollama-mcp/bin/ollama-mcp

Or in Claude Desktop's claude_desktop_config.json:

"ollama": {
  "command": "/path/to/ollama-mcp/bin/ollama-mcp"
}

User scope means it's available in every Claude Code project, not just one — which matters because Ollama isn't a per-project thing.

#What shipped after v0.1, and a security pass at v0.6

A lot of the original "what's next" list landed:

What didn't ship: streaming generate. The MCP-SSE path is interesting but the value-add is marginal for inference where the whole response is the artifact. Skipping for now.

v0.6 was a security audit of the whole repo, with a few findings worth calling out because the patterns generalize:

The vision tool was an arbitrary local-file-read. vision image_path=... accepted any path and base64-encoded the bytes for upload to Ollama. I'd been thinking of it as "give Claude a screenshot to look at." A malicious caller could think of it as "exfiltrate ~/.ssh/id_rsa to the configured Ollama host." The fix is a magic-byte gate: read the first 12 bytes, check for PNG/JPEG/WebP/GIF signatures, refuse anything else. Ollama vision models only accept those formats anyway, so the gate doesn't lose legitimate use.

The HTTP client had no body cap. io.ReadAll(resp.Body) against a 10 GB response would have happily eaten the MCP's memory. 50 MB cap with explicit error on overflow.

The HTTP client followed redirects. The Ollama URL is user-supplied via config; a redirect from a typo'd URL could land us at internal services. CheckRedirect: ErrUseLastResponse shuts that path.

Per-request timeouts replaced a single 30-min http.Client.Timeout. The old version forced every operation into the same budget. The new version uses ctx deadlines: 60s default for most calls, 30 min for pull_model only. The caller's existing ctx deadline is always respected.

Config rejected non-deterministic multi-host setups. With multiple hosts and no explicit default: true, the resolver picked whichever host Go's map iteration happened to return first. Tool calls landed on different hosts run-to-run. The validator now rejects this at config-load.

These all became part of a checklist I now run against every Go HTTP client. The full list and the test-the-contract discipline are in a separate post on the audit methodology.

The repo is at github.com/jasondillingham/ollama-mcp. MIT licensed. Public as of this week.

The bigger lesson: most of my homelab tooling was built as CLIs because that's what I knew. Wrapping it as MCP changes who the tool is for. The CLI was for me; the MCP is for the LLM that's helping me. Different audiences, different ergonomics, same underlying logic.


← all writing