33 tasks, 1,231 classified emails, and a code review where two of the bugs weren't real

2026-04-12 · ~~10 min read · go · ai · ops

Dispatch is the AP-email triage tool I've been building. Outlook stays as the reader and reply tool; Dispatch reads scattered AP shared mailboxes via Microsoft Graph, auto-tags every message with its vendor and PO, runs invoice reconciliation against the ERP, and gives the AP team a workflow surface that scales beyond what categories alone can express.

Why not buy Front? Front is $50/user/month — for 20+ AP and sales clerks that's $12K+/year. M365 we already pay for. The cost-driven design has one consequence: state lives in Outlook Categories, not in a Dispatch database. Edits in Outlook flow through to Dispatch on next poll. No DB to drift.

Coming into this session it was a working prototype. Around 12k lines of Go, deployed to staging, processing real mail copies. The features were there; the polish wasn't. Three god-files were forming (web/main.go at 3,500 lines, cache.go at 2,400, worker/main.go at 1,800), the AP-mode flow had been hammered together over multiple sessions and showed it, and there was no migration system on the SQLite cache.

I asked Claude to do a code review at the start of the session to set the agenda.

#The code review (and the two findings that weren't real)

Eight bug-list items came out of the review:

Nil deref in three near-identical storeErr closures
CSRF gap on /admin/restart-workers (HTTP Basic auth makes it CSRF-able)
MSSQL pool sized at 4, undersized for 8+4+3 worker concurrency
PDF byte-slice copy into fallback jobs (sometimes "50MB+" per escalation)
OData injection pattern in the Graph client conversation filter
SHA256 path validation in the blobstore
Graph token refresh TOCTOU race
Channel close race in extract pool drain

Six of those were real. I fixed them.

Two were wrong.

The "PDF byte-slice copy" finding (#4) misread Go semantics. The fallbackJob struct has a png []byte field, not the source PDF, and []byte over a channel copies the 24-byte slice header — not the underlying byte array. Worst-case memory is ~12MB across 24 queued jobs. There is no copy.

The "token refresh TOCTOU" finding (#7) read the lock order backwards. The mutex is acquired BEFORE the expiry check, not after. The whole function is serialized correctly.

Both are exactly the kind of mistake that's plausible-sounding to a non-specialist reviewer and would have eaten an hour of refactoring for nothing. Always verify findings against the actual code, not the summary. Both wrong findings now sit closed in the task tracker with explanations of why they're not real, so future agents that flag the same patterns see the dismissal reasoning.

#Hybrid classifier: Go rules first, AI fallback

The biggest architectural change of the session: the worker used to call the AI classifier (a small Gemma variant on a consumer GPU) on every non-internal message that didn't already have a clear PO. Each call costs ~1.5s on a warm GPU and contends with the vision-extraction queue. For ~60% of mail the answer is obvious — the subject contains "Invoice #12345", the sender is donotreply@vendor.com, the body has "remittance advice", or the filename is Statement_April.pdf. We were paying GPU time to ask an LLM a question regex can answer.

So I added aiclass.DeterministicKind(subject, sender, body) — high-precision Go regexes for Invoice / OrderConfirmation / Statement / Payment / Credit / Marketing / Webinar / Newsletter. It returns a confident kind or empty. The worker calls it first; only if it returns empty does AI run.

Forty-one test cases pin the rules. Conservative throughout — when in doubt, return empty. Rough estimate: ~60% of mail bypasses AI entirely. Wall-clock impact is small (the AI calls are parallel anyway), but the GPU contention drops materially and the classifier is now zero-marginal-cost for the easy majority.

Pairs with another change shipped earlier: classify EVERY incoming message, regardless of whether extraction will run. That filled a gap where messages with a clear PO + attachment were skipping classification entirely. Now every row in the queue has a Kind tag for the dropdown filter, the Doc column, and the analytics page to use.

#Backfilling 1,231 messages in 15 minutes

When classify-on-arrival shipped, the historical 1,475 untagged messages were still untagged. I built cmd/classify-untagged as a one-shot — reads cache rows where categories_json NOT LIKE '%Kind: %', calls the AI classifier, PATCHes Outlook + cache.

First run timed out 10/10 calls because the GPU inference host was saturated draining a 149-message rescan queue I'd kicked off earlier. Dead in the water.

Tried CPU-only on a 32-thread Xeon E5-2690 with 346 GB RAM. The small Gemma on CPU: 2 minutes 19 seconds per inference. A 1B-parameter Llama variant on the same box: 60 seconds cold-start, then still 60+ seconds warm because the CPUs were throttled to 60% under thermal management. Dead-end. The instinct was right ahead of the math: old Xeons without AVX-512 are a poor fit for inference even with a tiny model.

Waited for the GPU host to drain. Re-ran the one-shot. Real numbers from a warm GPU classify:

classified=1475  written=1231  skipped(empty/Other)=243  errs=1
done in 15m3s  ·  1.4 messages/sec

The single error was a Graph 404 — the message was deleted from Outlook between the cache snapshot and the PATCH. Cache rows can outlive their messages.

Final Kind distribution across all messages: invoices dominated, then automation noise, then internal threads, then a much-larger-than-expected dispute count, then credits/statements/order confirmations in a long tail. The dispute count specifically was a surprise — much higher than I'd guessed. Worth a future analytics drill-down.

#4-bucket AP queue + view-as-clerk

The AP-mode flow had three tabs: Todo / Waiting / Done. The pilot AP clerk and I had sketched what the actual mental model is, and it's four:

Unassigned — pickup pile, Owner is empty
Todo — Owner == me, the work I'm actually doing
Waiting — Status: Blocked, an issue I've raised
Done — Status: Done (Posted)

Reframing the buckets surfaced a separate problem: clerks need to see what other clerks are working on. Not impersonate them — just glance at their queue. I added ?view=<userid> that filters Todo by another clerk's owned messages with a read-only banner, hidden decision bar, and disabled keyboard shortcuts. The selector is a dropdown next to the queue counter.

While I was in there, I added side-by-side mode — when the message has a PDF, the PDF goes sticky on the left at full viewport height, and the right column scrolls with headline / totals / recon mismatch / collapsible email body / notes / Email Buyer + Email Vendor buttons. Two side-by-side panels means a clerk doesn't have to scroll the PDF away to read the recon details.

Plus smaller polish: keyboard nav in the Assign / Hold pickers (↑/↓/Enter/Esc), PO# + Invoice# as colored pills in the headline, "Email Rep" renamed to "Email Buyer" (the rep IS the buyer; rename matches semantics), and pre-classification messages hidden from Unassigned (waited until classify-on-arrival shipped so the Unassigned pile doesn't briefly stutter).

#The follow-up timer

A held message (Status: Blocked + Blocker: Vendor) is dead-letter today. Nothing surfaces it back; the clerk has to remember. I built an automatic resurface system: when a clerk holds with reason X, the system stamps a followup_at timestamp based on the reason. Vendor holds get 72h (vendor needs time to respond). PO holds get 48h. Pricing/Purchasing get 24h (internal, faster). Won't Pay never resurfaces — explicit close.

A goroutine ticks every 60 seconds, finds rows where followup_at <= now, and resurfaces them: strip Blocker:* + Status:Blocked, set Status:New, append a system note ("Follow-up timer fired at X — resurfaced from Waiting"), clear the timestamp. The clerk sees the message reappear in their Todo with breadcrumb context.

Storage is one new column on the messages table (followup_at TIMESTAMP), which prompted me to actually build the schema migrations system I should have built months ago.

#Schema versions: a migrations table that costs $0 up front

The cache had been using CREATE TABLE IF NOT EXISTS + ad-hoc ALTER TABLE ADD COLUMN migrations. Idempotent, but it left a regex on sqlite_master.sql to detect a legacy CHECK(id=1) constraint as evidence of pre-pool-split schema. That regex is a smell. I added a schema_migrations table that records each applied version, an ordered migrations slice with versioned entries, and a runner that skips already-applied versions. v1 is the baseline (every existing CREATE/ALTER); v2 added the followup_at column.

The cost of building this was ~30 minutes. The cost of NOT building it would be every future schema change reasoning about "have I run this on this DB yet?" without a way to check. Future-me will know exactly what schema state any DB is in.

#ACME vs. Bacme

The vendor resolver couldn't handle ambiguous corporate domains. A sender at accounts@parent-corp.com resolved to no specific ERP vendor because the parent corporation has multiple sub-account brands in the master. Today: Vendor: Unknown. The clerk had to figure out the brand from the email body.

I added a MatchBrand match type. When a domain is ambiguous AND the domain label appears in at least one candidate's vendor name, emit the brand as the vendor — Vendor: ACME, no specific VendorID. PO override still kicks in if the message has a PO number; manual disambiguation otherwise. Generic domains like gmail.com still return Unknown because no real vendor name contains "gmail."

Then in the end-of-day bug hunt, a fresh review caught a regression in this code: I'd written strings.Contains(normalizeName(c.VendorName), label). So acmecorp.com would match a vendor "Bacme Inc" because "acme" is a substring of "bacme." The ERP doesn't currently have a vendor with that exact collision, but the latent bug is real. Fix: word-token match instead of substring, with regression tests pinning both the false-positive case (Bacme stays Unknown) and the real-word case (ACME Industrial matches).

#The bug hunt that found its own bugs

End of session, I asked for a fresh review of everything we'd shipped. Three parallel agents (worker layer, web layer, cache + classifier) returned a combined ~10 findings.

After verification: 2 real bugs, 5 false alarms, 3 nice-to-have nits.

The 2 real bugs:

Fallback channel close race — earlier in the day I'd fixed this on the extract pool (read len BEFORE close). The fallback drain has the same pattern and I'd missed it. Cosmetic — info-log line only — but should match.
The ACME/Bacme substring bug described above.

The false alarms reproduced patterns from the original code review: Go slice semantics misreads, lock-order misreads, "race condition" claims that weren't (the followup sweeper runs in a single-threaded for range t.C loop, no concurrent ticks). One agent flagged a UTF-8 issue in repairTruncatedJSON — I checked: the function tracks inStr based on " (ASCII 0x22), and UTF-8 multi-byte sequences never contain 0x22 in any byte, so byte-iteration is safe. Wrong claim.

The pattern is: AI code reviews are good at surface-level pattern recognition and bad at deep correctness arguments about Go concurrency primitives and Unicode encoding. Trust the lower-level findings (regex injection, missing nil checks, undersized pools); verify the deep ones against the actual code.

#What's different now

Three god-files split into 13 (cache: 5 files, worker: 2, web: 4). cache.go down 38%, worker main.go down 40%, web main.go down 20%
schema_migrations table records every migration with timestamp + description. v2 added the followup_at column
Hybrid classifier: deterministic Go rules pre-empt the AI for ~60% of mail. Conservative regexes; AI handles the rest
1,231 historical messages classified in 15 minutes via a one-shot tool
AP queue: 4 buckets, side-by-side PDF, view-as-clerk, keyboard nav, follow-up timer, body card
Vendor resolver: brand match for ambiguous corporate domains, word-token matching to avoid Bacme-style false positives
Verify pipeline: tolerates type-loose JSON output ("status": 1 instead of "match"), drops duplicate top-level keys (model-repeating-itself pattern), repairs truncated JSON
Admin overview page with leaderboards: vendors / buyers with most disputes, vendors with most blocked items, kind distribution
CSRF middleware on every state-changing endpoint (HTTP Basic auth makes them CSRF-able)
6 review-bug fixes shipped, 2 review findings dismissed with documentation explaining why they weren't real
33 tasks completed, 9,047 lines added, 1,724 deleted, all on staging and verified

#What I learned

AI code reviews are right about ~75% of findings, but the wrong 25% has a consistent shape. Go slice semantics, lock-order analysis, byte-vs-rune iteration, channel/goroutine reasoning. The narrow pattern matches (regex injection, nil checks, missing validation) are reliable; the deep correctness claims need verification against the actual code. Document the dismissals so future agents see the reasoning when they re-flag the same pattern.

Hybrid classifiers (rules first, AI fallback) win on the cheap-and-easy 60% and reserve AI for the hard 40%. This pattern works because the easy majority has obvious signals — subject keywords, sender domains, filename patterns — that regex handles perfectly. Trying to write rules for the long tail is where the cost explodes; AI does that for free. Same shape as a CDN cache: fast path for hot keys, full path for cold keys. The mental model "AI is the slow path you fall back to" is correct.

Same-package multi-file splits are the cheap version of "extract a sub-package." No exports to manage, no import cycles to worry about, no API surface to design. You get the readability win (each file is a coherent subsystem) without the structural commitment. cache.go split into 5 files in one afternoon; sub-package extraction would have been a full day with churn across every callsite. Sub-packages are still the right end state for stable subsystems, but the file split is what unblocks navigation right now.

Schema versioning is cheap to add and free thereafter. The cache had been using IF NOT EXISTS + ALTER patchwork — idempotent, but with no source of truth for "what version is this DB?" A 30-minute migration runner replaced the patchwork; every future schema change is now a 3-line append to the migrations slice. The cost was small; the value compounds with every change.

Don't trust your own substring matches against names. strings.Contains("bacme", "acme") is true. strings.Contains("harrow", "arrow") is true. The pattern "label as substring" is almost always wrong; you want "label as word" or "label as exact token." When matching company names, vendor identifiers, or anything human-named, default to word-token matching and write regression tests pinning both directions of the boundary case.

HTTP Basic auth needs CSRF protection. Browsers automatically attach Basic credentials to every same-domain request, including cross-origin POSTs initiated by an attacker page. The defense isn't a token — it's the Origin / Referer header check. Modern browsers always send Origin on POST; the middleware is ~20 lines. Apply it to every state-changing endpoint, not just the obvious ones.

#The tech

Go stdlib net/http + chi router + html/template for the web; goroutines + channels for the worker pools (sort/extract/fallback)
HTMX + Alpine.js + Bootstrap 5 for UI, CDN-loaded, no build step
Microsoft Graph API via direct HTTP — client credentials OAuth2, Mail.ReadWrite permission
SQLite for cache (modernc.org/sqlite, WAL mode, busy_timeout 5s) — schema v2 today, migrations append-only
Filesystem blobstore for full message bodies + PDFs, sha256-deduped, sitting on a NAS
MSSQL read-only against the ERP — connection pool sized at 12 open / 4 idle for 8+4+3 worker concurrency
Ollama for inference — small Gemma text classifier on a consumer GPU (~1.5s warm), small vision model on the same box, larger Gemma verify-fallback on a Tesla P40
Hybrid classifier: Go regex first (internal/aiclass/deterministic.go), AI fallback when no rule matches
systemd for service lifecycle on a staging Ubuntu box
Worker pools: 8 sort + 4 extract + 3 fallback goroutines, each with cooldown circuit-breaking on failed PDFs
Follow-up timer: 60s ticker, 1 column on the messages table, ~150 lines of Go
CSRF middleware: same-origin check (Origin then Referer fallback) on all state-changing requests
One-shot classifier backfill: cmd/classify-untagged, dry-run by default, 1.4 msg/sec at concurrency=4

← all writing