Some AI work is bounded by a hard rule: certain data cannot leave the building. Not to a frontier-model API, not to a cloud OCR service, not to a third-party vector database. The whole pipeline runs on hardware the client controls, or it doesn't run at all. Treated as a problem, that's annoying. Treated as a design input, it's the most useful one you'll get — it rules out the patterns you might otherwise reach for by reflex, so the patterns that survive have already paid for themselves.
Three moves under the boundary
First instinct on a document pipeline: write a path for every input shape — scans, photos, faxes, the whole zoo. Second instinct, after asking the operator what they actually receive: 100% digital PDFs with a real text layer. The OCR lane was theoretical. Cut it. The inspector becomes a single-path gate — digital advances, anything else quarantines for human triage. Day a scan actually arrives, add the lane back or push OCR upstream into intake. Until then, YAGNI.
Inside the boundary, the client already runs a hardened redactor — fourteen recognizers tuned against real production traffic, a precision-first threshold, a deny-list seeded from their own staff table. We considered designing our own. Discarded the idea inside an hour. Wrap what works. The middle is solved; the seams — orchestration, audit, the bridge to the human reviewer — are where we earn our keep.
Every document passes through a licensed reviewer before release. Not a regression — the load-bearing trust contract of the whole system. The reviewer sees the redacted PDF next to a span ledger (every redaction, every recognizer, every confidence score) and approves, requests more redactions, or releases an over-redaction. The pipeline's job is to shorten their per-document time, not eliminate them. Auto-approve is banned.
What it looks like
Phase 1 · INTAKE 📁 Local Drive · PDF Inbox │ Phase 2 · INSPECT 🔍 Inspect (PyMuPDF) has text layer? page count? │ 🚦 Gate · digital-only digital → next │ scan ╌╌► quarantine ⚠ Phase 3 · EXTRACT 📄 Docling (digital) text + tree + tables │ Phase 4 · STRUCTURE 🤖 Classify + MD skeleton local LLM · on-prem │ Phase 5 · DETECT ◄────────────╮ (request more redactions) 🛡️ Detect PII inherited recognizers │ precision-first threshold Phase 6 · REDACT ✂️ In-place patch + verify text-extract audit │ ╌╌► detect 3 strikes → quarantine Phase 7 · EMIT 💾 Sanitized MD + redacted PDF atomic write + audit row │ 🧑⚖️ Staff review · CPA / EA ≤30 min in season │ decision → signal row Phase 8 · END 🏁 Done — downstream reads the sanitized markdown
The cuts
Two design decisions worth naming, both counterintuitive to anyone who reads too many architecture diagrams.
| Component | Why it looked good | Why it's gone | Revisit when |
|---|---|---|---|
| OCR scan lane | Handle any input shape | Intake is 100% digital. Lane unused. | A scan actually arrives |
| LLM arbiter | Triage low-confidence spans for the reviewer | No data yet that recognizers leave enough noise to be worth a model | Review time > SLA or miss-rate > floor |
| 2nd-pass OCR verify | Catch PII hiding in raster pixels | Digital-only intake — no pixels for PII to hide in | OCR lane comes back |
Why the boundary helps
The frame I'd push back on is "on-prem is the compromise we make for compliance." It is not — it's a forcing function that prevents three failure modes we see often in cloud-first projects.
Blocks scope creep into the model layer. When you can't call a frontier API mid-pipeline, you stop trying to. Deterministic code stays deterministic. The model is reserved for the one or two steps where pattern-matching beats rules — and the model is whatever runs on hardware you control, which is usually small. Smaller models force tighter prompts. Tighter prompts force clearer thinking.
Surfaces what you actually need from the human. You can't punt the hard cases to a cloud arbiter you don't own. The hard cases land on the reviewer's desk — so the rest of the system gets designed to respect their time: rank the queue, surface the rationale, default to under-redaction so their instinct (catch the miss) matches the system's.
Makes the audit story trivial. Every byte of PII lives on a machine the client owns. Every reviewer decision lives in a row in their database. The compliance conversation collapses into "show us the audit table" — and the table is there, because the pipeline can't work without it.
The pattern travels
Specifics are domain-bound — tax forms, a specific redactor, a specific reviewer role. The shape is general. Anywhere data residency is non-negotiable, the design collapses to the same three moves: cut the lane you don't need, inherit the part that works, keep the human gate. Anywhere it doesn't, the same discipline is still good engineering — the constraint just isn't doing the work of forcing it on you.