// AI PRODUCT ENGINEERING

I orchestrate AI agents
to ship measurable
products.

The demos below are the proof. The discipline below is the differentiator.

// The work

I direct AI coding agents — Claude Code, Copilot, Codex, others — through architectural decisions, drive the build, and then I test what comes back. I don't review the code line by line; I run it, push the edges (drop a screenshot it hasn't seen, ask the question the README skips, lower the threshold until it breaks), and when something doesn't work we hunt down the cause together. The keystrokes are increasingly not mine. The product decisions, the validation, and the refusal to ship what hasn't been proven end-to-end are.

The toolkit varies with the task — spec-kit, GSD, BMAD, or my own UCAI plugin to drive the build; Pragma to block AI-written tests when they're gamed; benchmark tables when a decision is worth pinning. The format follows the project; the discipline doesn't. Hard constraints, SOLID and DRY applied gradually — sometimes the right answer is a sharp principle, sometimes it's a grey-zone judgment that an 80%-correct shape now beats a 100%-correct shape later. Docs updated as the code moves.

// The thesis

Cognition is context-bound. Humans and large language models are two implementations of the same problem class. The bottleneck for both is the spec — the completeness of the context fed in. Most projects, agent systems, and human collaborations fail not because the engine is dumb, but because the input was incomplete.

The implication: as agents get more capable, spec authorship becomes the leverage point. Execution, retrieval, and even reasoning are commoditising. What stays scarce is the discipline to drive a spec to asymptotic completeness, and to propagate it between cognitive nodes — human or silicon — without lossy resampling. The durable role isn't writing code. It's sensor (observation outside the model), spec author (completeness as discipline), adversarial reviewer (attack the spec before it ships), and dispatcher (route the spec to the right engine, re-route when one fails).

I treat my own collaboration with agents as an instance of that frame. The system prompt, the rules I keep, the notes I take and the hooks I write are all the same thing: a substrate the model thinks through. Durable behavioural rules live as markdown in a version-controlled directory, not as advice scattered across prompts — and they're enforced in two layers. Rules with a clear action signature ("block git commit if PII present," "refuse bash when a dedicated tool fits," "stop response if it ends in a check-in question") are promoted to deterministic hooks that run before tool calls or after responses; the model never gets a chance to rationalise its way past them. Rules that depend on conversational judgment — uncertainty signaling, brevity, deferring — stay as markdown and are routed in by bouncer, a small FastAPI daemon that scores each user prompt against the rule library and admits only the contextually relevant ones plus a small always-on core. Same dual layering ESLint (autofix vs warn) and OPA (deny vs alert) have used for years on large rule sets — borrowed, because it works.

// Writing

Long-form essays where the operating discipline gets argued out, scope-conditioned, and stress-tested against its own falsifiers.

~/essays/context-as-cognitive-substrate

essay

Context as Cognitive Substrate

~22 min read · AI engineering · framing concept · published 2026-05 · updated 2026-05-05.

Cognition is bounded by context. Humans and LLMs are two implementations of the same problem class; the bottleneck for both is the spec. The piece distinguishes cognitive, operational, and economic claims, names where the framework breaks (the metis boundary), states falsifiable predictions with binding falsifiers, and ends in a three-rule distillation that survives Tuesday morning.

Read the essay All essays

// Lab notes

Things I run for myself, not shipped products. They earn their place by carrying the same operating discipline as the published work — and they're where most of the load testing of that discipline actually happens.

Continuous-research pipeline always-on

// Python · forked from an academic-paper agent and rebuilt for tech/industry research · writes interlinked Obsidian wiki pages instead of LaTeX.

A 12-stage pipeline organised in four phases — Scout (parse a topic into 5–8 sub-questions and search queries), Gather (hit six source backends across web and academic databases, dedupe, fetch, LLM-screen for relevance, produce knowledge cards), Analyze (cluster findings, build comparison tables, then re-read everything through three deliberately distinct lenses: pragmatist — what's usable today, contrarian — what are the risks nobody is talking about, builder — what could be assembled on top), and Synthesize (outline, draft, run a quality + slop-detection pass, revise, write to the vault).

It runs on a schedule against a topic queue. Output is wiki pages, not reports — every entity gets its own page with backlinks, so a year of research compounds into a navigable graph instead of an inbox of PDFs. The same vault feeds the rule router behind bouncer, so research findings can become routable rules without leaving their original context.

Always-connected operating environment always-on

// two always-on machines + workstation + phone, mesh-VPN'd · no public ports · watchdogs feed a private chat bot · automatic recovery before anything pages.

The machines I work on are stitched together by an encrypted mesh, not exposed to the public internet. From a phone I can SSH into any of them, hit private web UIs (vault browser, session browser, dashboards, local LLM frontend), and watch services without opening a single port to the world. Same access from a laptop on a coffee-shop network. The discipline this enforces: no service ever needs a public hostname to be useful, which removes a whole class of attack surface and keeps the threat model honest.

Underneath that there's a layered watchdog stack — heartbeat logger (one-minute ring buffer for post-mortems), connectivity watchdog (link / gateway / internet / mesh), system health (disk, critical services, thermal), index-freshness watchdog (catches stale background jobs), filesystem-event watcher that reloads the rule router the moment a rule file changes locally or arrives via sync from another machine. Failures recover automatically where they can; what can't be auto-fixed pages me on a private chat bot with the diagnostic already attached. Backups are timer-driven with jitter and persistence, so a missed window fires on next boot rather than going silent.

Boot notify — every reboot pings the bot, so I know unattended hardware came back up.
Belt-and-braces reloads — primary path is the inotify watcher; a 15-minute cron does an idempotent re-poke in case the watcher dies.
Logs-as-context — every watchdog writes a self-truncating log, which is exactly the surface a future agent will read when something needs explaining.
No-secrets-in-repos — credentials live in mode 0600 root-owned files on the box that needs them, never in git.

None of this is a product; it's the substrate the products grow on. It's also the reason I trust agents to do long-running autonomous work — the box is observable, the failure modes are caught, and the recovery is automatic.

Local-LLM lane privacy + cost

// integrated GPU on a small home server · GGUF runtime · web frontend on the private mesh.

A local-inference lane sits next to the cloud-frontier lane. Small dense models capped around 9B for fast general-purpose work; larger models only when they're Mixture-of-Experts with a small active expert per token (so a 30B-class model with a 3–4B active expert runs comfortably on integrated graphics, while a dense 30B would not). I don't try to match frontier-model quality locally — that's the wrong target.

What it's actually for: privacy lane for anything I'd rather not send to a third-party API (vault-capture summarisation, draft passes over personal notes, peer-review of plans I haven't published yet); cost-zero secondary opinion when one model's answer needs sanity-checking but a second cloud round-trip isn't worth it; offline fallback for short internet outages. The discipline: local inference is a fallback and a privacy lane, not a replacement for frontier models on hard reasoning. Routing the right task to the right engine is part of the dispatcher role from the essay; this is that role made physical.

// What I enforce

Each rule below is either checked by code (a hook that blocks the action) or written as a single imperative trigger that the routing daemon re-injects when the situation calls for it. Rules that resist both forms don't get captured.

NEVER REINVENT. Search GitHub and the literature before building. Use as-is, build on top, or borrow parts. The cost of a 30-minute search beats the cost of maintaining a 200-line reinvention forever.

AUTOMATION MUST BE TRANSPARENT, NOT OPT-IN. Solutions that require me to remember to invoke them just relocate cognitive load. Hooks and watchdogs that fire on their own are acceptable; slash commands and "run X first" instructions are not.

EVERY RULE NEEDS AN ENFORCER. If a rule can be machine-checked (paths, secrets, tool selection), it's a hook. If not, it's a one-line trigger written in the imperative. A rule that resists both forms is a preference, not a rule.

VERIFY BEFORE CLAIMING DONE. Tests run, logs checked, behaviour diffed. Applies to me, applies harder to agents — they will confidently report success on work they didn't do unless asked to prove it.

GROUND PERSONAL AND TIME-SENSITIVE QUESTIONS IN SOURCE. Questions about systems I've built, my preferences, or state past the model's training cutoff get answered from the project's source-of-truth files first, not from training-data pattern matching.

EVERY TOKEN EARNS ITS PLACE. Long is the failure mode, not the default. Strip recaps, hedging, restatements. Lead with the answer; caveats only when they change it. Code and tool output are exempt.

NO SESSION-WRAP CHATTER. When work concludes, stop. No "good place to pause?", no "want me to continue?", no unsolicited next-steps lists. Momentum is the scarce resource; meta-commentary interrupts it.

Rules live as plain markdown files, organised by trigger context. bouncer handles the routing into the agent's per-turn context; a separate output-validator hook blocks responses that match known violation patterns (trailing check-in questions, recap sections, padded preambles). The hot list of behavioural rules stops being prose I have to remind the model about and becomes a structural part of the loop.

// Selected projects

~/projects/bouncer

bouncer

Python · MIT · FastAPI daemon · per-rule activation predicates (regex + semantic) · the residual layer under deterministic hooks.

Predicate-gated rule injection for long-running LLM sessions. Started as a pure routing daemon — re-inject a hot list of behavioural rules into every prompt to fight drift, but cap what's injected by situational match (a regex trigger_keywords filter and/or a per-rule embedding prototype) instead of letting the list dilute past 30 rules. As the rule library grew, the rules with clear action signatures got promoted to PreToolUse / Stop hooks — code that runs deterministically before any tool call or after any response, no LLM interpretation. bouncer's role shrank to the residual: rules that depend on conversational judgment. Two-layer enforcement, same shape as ESLint's autofix/warn and OPA's deny/alert. A private companion tool reads my agent transcripts, clusters recurring corrections and omission-catches across weeks, and surfaces the patterns that justify becoming a new rule or a new hook — the discovery loop that feeds the library bouncer enforces.

GitHub repo

~/projects/UCAI

27★

UCAI

JavaScript · MIT · Claude Code plugin, marketplace-installable · v2.2 with programmatic phase enforcement.

A Claude Code plugin that solves the same problems as community frameworks (GSD, BMAD, Ralph, Agent OS) — but using Claude Code's native architecture instead of fighting it. v2.2 adds programmatic phase enforcement via a companion contingency engine.

GitHub repo

~/projects/inflate

inflate

Go · TUI that sits next to Claude Code · multi-provider (Anthropic, DeepSeek, OpenAI-compatible, Gemini, local Ollama) · live provider switch.

Type a fragment, get a context-loaded prompt for Claude Code on your clipboard. Inflate harvests git status, shell history, your editor's open file, the current Claude Code session and any running dev tools in parallel, then asks an LLM you choose to inflate the fragment into a structured prompt. ? then p cycles between cloud providers and any installed Ollama model with no restart. Native /api/chat for Ollama, think:false and num_ctx opt-in to keep local models honest about truncation. Streams the inflated prompt as it generates.

GitHub repo Architecture

~/projects/Pragma

Pragma

Python · MIT · Claude Code plugin + CLI · pytest, Vitest, Jest · three layered tiers — AST, coverage, LLM judge.

Watches every test file your AI assistant writes and blocks the edit when the test is gamed. Three tiers, each catching what the previous one misses. Tier 1 — fast deterministic AST classifier (~10 ms): tautologies, mocked-away targets, swallowed exceptions, conditional assertions, monkeypatched fakes, vi.mock / jest.mock on default exports. Tier 2 — runs the test under coverage and asks whether the target's lines actually executed; if not, it isn't a test. Tier 3 — a small LLM (DeepSeek by default, any OpenAI-compatible endpoint, Ollama works) reads test + production code and decides whether the test verifies behaviour or just confidently asserts on its own mocks. Production target is inferred from imports, expected outcome from the test name. Zero config to start.

GitHub repo

~/projects/cc-session-browser

cc-session-browser

Python · MIT · FastAPI + vanilla JS, no build step · local-first single-machine web UI · Raycast-shaped retrieval over hundreds of Claude Code transcripts.

Reads ~/.claude/projects/<cwd>/<sid>.jsonl directly — search, date-grouped contact-sheet browsing, one-click resume across every project on the machine. Built because claude --resume only shows the most recent few per project and after a few weeks you have hundreds. Real activity sort by transcript mtime (catches autonomous work, not just last user message), staleness analyzer that flags abandoned/old/ deleted-project sessions, two-step archive with restorable sidecar, mobile-friendly over a private mesh. Nothing leaves the machine.

GitHub repo

~/projects/slim-mcp

slim-mcp

TypeScript · MIT · npm-installable MCP proxy · 72–77 % token reduction at 100 % accuracy benchmarked against Claude Sonnet 4.

An MCP proxy that gives agents their context window back via schema compression and lazy tool loading. Validated benchmark: 72–77 % token reduction at 100 % task accuracy, measured across 57 tools spanning 4 servers (120 API calls).

GitHub repo

~/projects/zeroshot-detect

live

zeroshot-detect

Python · MIT (model: Apache-2.0) · Gradio HF Space (CPU) · OWLv2 200M params, server-side overlay, per-label NMS.

Drop in any image. Type any English noun (or several: "a hat, a dog, a coffee cup"). Get bounding boxes drawn on the image — no class list, no fine-tuning. Server-side overlay so the screenshot you take IS the demo output. Same engineering shape as paperQA: ADRs, CI matrix, mypy strict, offline test suite.

GitHub repo Hugging Face Space ADRs

~/projects/paperQA

live

paperQA

Python · MIT · Gradio HF Space (CPU) + HF Inference API (Qwen 2.5) · faithfulness 1.000, must-cite 0.833 on 6-question gold set.

Upload one or many PDFs. Ask a question. Get an answer that cites the exact file and page it came from — and a deterministic post-hoc grounding check that drops citations whose claimed numbers don't appear on the cited page.

Headline numbers on the bundled gold set (Attention Is All You Need, 6 questions): citation faithfulness 1.000, must-cite rate 0.833, recall@3 0.833. Each architecture change ships with the measured delta it produced.

GitHub repo Hugging Face Space ADRs Baselines

// Try paperQA inline

// free CPU tier · cold-start 30–60s after a quiet period · subsequent questions are fast

// Try zeroshot-detect inline

// drop an image · type one or more nouns (e.g. a lion, a tiger, a zebra) · threshold ≈ 0.25 · first inference downloads OWLv2 (~30–60s); subsequent clicks 5–15s

// Looking for

AI product engineering, AI engineering, or AI-assisted developer roles where the orchestration story is the point.
Teams that already trust AI agents in their pipeline and want someone who can direct them with measurement discipline.
Remote or hybrid (Europe). Open to short-term contracts and full-time.

// The work

// The thesis

// Writing

Context as Cognitive Substrate

// Lab notes

Continuous-research pipeline always-on

Always-connected operating environment always-on

Local-LLM lane privacy + cost

// What I enforce

// Selected projects

bouncer

UCAI

inflate

Pragma

cc-session-browser

slim-mcp

zeroshot-detect

paperQA

// Try paperQA inline

// Try zeroshot-detect inline

// Looking for

// Get in touch