Building an AP workflow tool when the state already lives in Outlook

2026-04-30 · ~~11 min read · go · ai · ops

The first real decision in Dispatch wasn't a feature. It was a refusal.

The accounts-payable team I was building for processes a few thousand invoice emails a month across roughly twenty Outlook shared mailboxes. The work is repetitive: figure out which vendor sent it, look up the PO in the ERP, decide if it's reconcilable or disputed, hand it to whoever owns that vendor relationship, and remember to chase it if the vendor takes too long to answer. By the time a message reaches "done," it's been touched by 3-4 people across an unknown number of clicks.

The problem looks like a workflow tool problem. The shape that makes it different from a vendor's workflow tool problem is that the AP team already has a workflow surface — Outlook — that they will continue to use whether you build something or not. Email is where invoices arrive. Email is where vendors reply. Email is where the AP department's institutional memory lives in years of threads and forwards and "see below."

The vendors who sell AP automation know about Outlook. They mostly want to replace it. Their pitch is: stop using Outlook for AP, use our portal instead, here's an inbox in our product. That works at scale and it's defensibly the right answer when you're large enough to retrain a 50-person AP team. At the size I was building for, it was the wrong answer — the team would still use Outlook for everything else and now have a second tool for the half that the vendor cared about. Splitting their attention is worse than the original problem.

So the design constraint became: the AP team continues to live in Outlook. Whatever Dispatch is, it is not an Outlook replacement.

That constraint cascaded into everything else.

#Outlook Categories as the workflow database

Outlook Categories are the colored tag pills you can drag onto a message. They sync via Exchange. Every reply, every forward, every share carries them. They render natively in every Outlook client — desktop, web, mobile. Most users have used them, casually, for years.

What Outlook Categories also are, structurally, is a per-message string list synced through Microsoft Graph. You can read them. You can write them. They survive every Outlook UI surface unchanged.

The question I ran into early was: should Dispatch have its own database for workflow state — owner, status, blocker, follow-up timer — or should it use Outlook Categories as the storage layer?

A real database would have been the conventional answer. I chose Outlook Categories. The reasoning:

No drift. If the workflow state is in two places, those places will diverge. Some clerk will close a message in Outlook and Dispatch will keep showing it as open. Some clerk will assign in Dispatch and Outlook will keep showing it unassigned. Two-way sync is famously the hardest reliability problem in this kind of tool. The cleanest answer is: don't have two places.
Clerks already know how to use them. "Drag this category onto the message" is a five-second training. "Open Dispatch, find the message, click Assign" is the cognitive overhead I was trying to remove.
Survives the user using Outlook normally. A clerk forwarding a message to a vendor doesn't have to open Dispatch first. Dispatch just notices the new state on next poll.
State is portable. If Dispatch goes away tomorrow, the workflow data doesn't go with it. Categories stay on the messages. The team loses the dashboard but not the history.

So the data model is: Microsoft Graph polls every shared mailbox on a 60-second cadence, pulls each message's categories (Owner: alice, Status: New, Vendor: ACME, PO: 12345, Kind: Invoice), parses them as structured fields into a SQLite cache, and renders them in the Dispatch UI. When a clerk acts in Dispatch — assigns, holds, marks done — Dispatch writes the new categories back through Graph. Outlook is the source of truth; SQLite is a read-optimized projection of it.

The cache exists for two reasons: full-text search across thousands of messages would be unaffordably slow against Graph live, and the dashboard needs joinable structured data (kind distribution, vendor counts, dispute rates) that doesn't make sense to compute on every page render. But every cached row corresponds to a real category set on a real Outlook message. If you delete the cache and rebuild it, the workflow state survives.

This decision is the thing the rest of the system is shaped by. Most of the design choices below only make sense in the context of Outlook is the database.

#The classifier: rules first, AI second

Once messages are flowing in, every one needs a Kind tag — Invoice, Statement, Order Confirmation, Payment, Credit, Dispute, Marketing, Newsletter, Internal, Automation. The Kind tag drives filtering, sorting, and the analytics page.

The naive answer is to send every message to an LLM. "Classify this email's Kind." The model is a local Ollama instance running a small Gemma variant on a consumer GPU, and it does this well — about 1.5 seconds per message warm.

The problem is that most messages don't need an LLM. The subject line says "Invoice #12345." The sender is donotreply@vendor.com. The body has "remittance advice" twice. A regex catches it in microseconds and is more confident than the LLM is.

So the classifier is a two-stage pipeline:

func Classify(ctx context.Context, msg Message) (Kind, error) {
    if k := aiclass.DeterministicKind(msg.Subject, msg.From, msg.Body); k != "" {
        return k, nil
    }
    return aiclass.LLMKind(ctx, msg)
}

DeterministicKind is a hand-tuned set of high-precision Go rules. Subject keyword matches, sender domain matches, body phrase matches, attachment filename patterns. Conservative — it returns empty if there's any doubt. It catches roughly 60% of mail at zero marginal cost.

The remaining 40% goes to the LLM. Things that don't have obvious lexical signals — "Hi, attached is your statement for last month" with no subject keywords, vendor newsletters that disguise themselves as invoices, messages that genuinely span multiple kinds.

The mental model that makes this design click is: the LLM is the slow path you fall back to when the cheap path can't decide. Not the primary classifier with rules as backup. The rules ARE the primary; the LLM is the long-tail fallback. Same shape as a CDN cache — fast path for hot keys, full path for cold keys.

The performance is meaningful but not the headline. The headline is that the rules are the documentable classifier — every clerk can read them, debate them, propose new ones. The LLM is the unauditable fallback. Putting the rules first means most decisions are explainable; only the hard ones rely on a model whose reasoning you can't inspect.

#The vendor resolver

Every incoming message also gets a Vendor: category — the ERP record the message belongs to. This is harder than the Kind classifier because vendor identity in an ERP is structurally messy. Many vendors have multiple sub-accounts. Many sub-accounts share a parent domain. Many email addresses come from third-party billing platforms (accounts@billing-portal.com) that route for thousands of vendors.

The resolver is a layered match:

Exact PO override. If the message has an obvious PO number in the subject or body, look up the PO in the ERP and use the vendor on that PO. PO match is the single highest-precision signal — overrides everything else.
Email-to-vendor exact match. If the sender's email is in the ERP's contact records, match directly.
Domain-to-vendor unique match. If the sender's domain belongs to exactly one vendor in the ERP, match.
Brand match for ambiguous corporate domains. If the sender's domain is a known corporate parent with multiple sub-accounts, emit the parent brand as the vendor (no specific sub-account ID). The clerk disambiguates manually.
Unknown. Generic domains (gmail, outlook, the billing portals) and senders not in the system fall back to Unknown. The clerk sets the vendor manually; their action retrains the resolver for next time.

The brand-match step (#4) was the trickiest. The first version did substring matching — "if the sender's domain label appears in any vendor name, emit it." That worked for the obvious cases and immediately introduced a class of bugs you only discover when you start using the system on real data: a domain like acmecorp.com would match a vendor named Bacme Inc because "acme" is a substring of "bacme." No vendor master is large enough to avoid these collisions; the question is just when you notice.

The fix is word-token matching with regression tests pinning the failure case. "acme" matches "ACME Industrial" but does not match "Bacme Inc" is now a test. "arrow" matches "Arrow Electronics" but does not match "Harrow Tools" is also a test. The pattern "label as substring" is almost always wrong when matching company names; default to "label as word" and write the regression test the moment you see the first false positive.

#The clerk's view

The actual UI a clerk uses is built around four buckets, not three:

Unassigned — pickup pile, owner is empty
Todo — owner is the current clerk, the work they're actually doing
Waiting — they've raised a blocker (vendor needs to respond, PO needs to be created, pricing needs to be approved)
Done — final state

The original three-bucket version (Todo / Waiting / Done) lacked the Unassigned column and wedged unassigned messages into Todo with an empty-owner indicator. That worked technically and was confusing in practice. Pickup pile and active work look different to clerks. Splitting them clarified the mental model.

A few smaller design choices that turn out to matter more than they look:

View-as-clerk. A read-only mode where one clerk can see another clerk's queue, hidden decision bar and disabled keyboard shortcuts. Clerks need to glance at what their teammates are working on without impersonating them. Adding a ?view=<clerk> query param made this a five-line UI change.
Side-by-side PDF + email. When the message has an attachment, the PDF goes sticky on the left at full viewport height. The right column scrolls with headline / totals / reconciliation mismatch / collapsible body / notes / action buttons. Clerks don't have to scroll the PDF away to read the recon details.
Follow-up timer. When a clerk holds a message with a vendor blocker, the system stamps a followup_at timestamp 72 hours out. PO blockers get 48 hours. Pricing blockers get 24. Won't-Pay never resurfaces. A 60-second background sweeper finds expired timestamps and resurfaces the messages back to Todo with breadcrumb context. This replaces the old reality where a held message died in Waiting until someone manually checked.
Keyboard-first navigation. Up/down/enter/esc in every picker. Clerks process hundreds of messages a day; mouse latency is a multiplier on their fatigue.

None of these are LLM-driven features. They're the low-glamour usability work that makes the difference between "the tool exists" and "the tool gets used."

#Vision extraction with verify-fallback

Some workflows need information out of the PDF, not just metadata about it. The recon mismatch column needs vendor invoice totals; the dispute view needs the vendor's stated reason. PDFs and embedded images go through a vision-capable LLM (a different small Ollama model on the same GPU host) that returns a structured JSON: {vendor_total, line_items, dispute_reason, ...}.

Vision extraction is expensive per message — multiple seconds, sometimes ten. It can't run on every message synchronously. The worker pool pattern is: 8 sort goroutines triaging incoming mail into kinds, 4 extract goroutines doing vision extraction on PDFs that need it, and 3 fallback goroutines that re-run extraction with a larger verifier model when the first pass returns low-confidence output.

The two-tier model approach matters. The fast model gets the easy 90% right. When it returns a result that looks suspect (vendor_total: null, line items that don't sum, a value field with unexpected JSON shape), the fallback re-runs the same extraction against a larger model with more context. The verifier is slow but rare. The economics work because the fast model handles most messages and the verifier only runs on the unsure ones.

A separate observation that's tangential to AP but worth flagging for anyone building similar pipelines: AI models output type-loose JSON. Sometimes "status": "match", sometimes "status": 1, sometimes both keys at the top level with conflicting values, sometimes truncated JSON because the response hit max-tokens mid-object. The verify pipeline tolerates all of these — type coercion on known fields, last-write-wins on duplicate top-level keys, a JSON repair step that rebuilds truncated structures by counting brace nesting. Your AI extraction pipeline will encounter every variant of broken JSON; you cannot trust the model to produce clean output across thousands of calls.

#The boring infrastructure

The parts of the system that don't show up in feature lists:

Microsoft Graph polling. Client-credentials OAuth2 with Mail.ReadWrite scope, polling each watched mailbox every 60 seconds. Token refresh runs on a separate goroutine; the lock order is acquire-mutex-then-check-expiry to avoid TOCTOU races.
SQLite cache. Modernc's pure-Go driver, WAL mode, busy_timeout 5 seconds. A schema_migrations table tracks every applied schema version with timestamp and description. Adding a new column is a three-line append to the migrations slice; the runner skips already-applied versions.
Filesystem blobstore. Full message bodies and PDFs stored on a NAS, sha256-deduped to handle the case where the same vendor invoice gets sent to multiple mailboxes simultaneously.
MSSQL pool against the ERP. Read-only, sized at 12 open / 4 idle for the worker concurrency. Performance work that catches you if you don't size it for your worker pools.
CSRF middleware. HTTP Basic auth is browser-attached on every same-origin request, including cross-origin POSTs initiated by an attacker page. The defense is an Origin / Referer check on every state-changing endpoint. Twenty lines of middleware.
systemd for service lifecycle. The same pattern every other internal Go service uses — a deploy CLI managing systemd units. Rollback is a one-line command.

None of this is interesting on its own. All of it has to work for the interesting parts to function.

#What I'd build differently

Two things in retrospect.

Schema migrations from day one. The cache spent its first months as CREATE TABLE IF NOT EXISTS plus ad-hoc ALTER TABLE ADD COLUMN. Idempotent, but no source of truth for "what version is this DB?" I added a real migrations system later — 30 minutes of work, free thereafter. Should have been there before the second column was added. Every time you defer it, the cost goes up by one schema change.

Don't trust your own substring matches against names. The acme matches Bacme bug ate a couple of hours and shipped to staging before being caught. The lesson generalizes: any time you're matching one human-named string against another (vendor names, customer names, domain labels, person names), default to word-token matching, never substring. Write the regression test the moment you see the first collision. The second collision is going to be a vendor your team actually invoices, and it'll be live before you notice.

#What this is, and isn't

Dispatch isn't a replacement for the AP automation vendors. It's the integration layer that the vendors specifically don't ship — work happens in Outlook, dashboard happens in Dispatch, ERP integration happens via direct MSSQL reads, AI handles the long tail of classification, the rest is glue. The vendors solve the 90% that's the same at every company; this is the 10% that has to be built per-company because the 10% IS the per-company part.

That gap is the integration-glue work that internal-platform teams ship at every company that's bigger than five people and smaller than five thousand. It's not novel. It's not a category-defining product. It's the kind of tooling that makes a specific team's specific workflow ten times more pleasant for an investment that pays back in operator hours saved within the first quarter.

Build vs. buy isn't a binary. The right answer at most companies is: buy the commodity, build the glue. The trick is recognizing which is which.

← all writing