// AI ENGINEERING · FRAMING

Context as Cognitive Substrate

May 2026 · ~22 min read · updated 2026-05-05

A framing concept: cognition — biological or artificial — is bounded by the context the cognitive engine has access to. Humans and large language models are two implementations of the same problem class. The bottleneck for both is the spec: the completeness of the context fed in. Most projects, agent systems, and human collaborations fail not because the engine is dumb, but because the input was incomplete.

This page makes claims at three levels of strength. The reader should track which level a given claim is making before judging it.

Cognitive (strong): Spec discipline + lossless context propagation produces measurable functional convergence between cognitive engines, substrate-independent.
Operational (defensible, bounded): That convergence translates to reliable productivity gains in codifiable, bounded-stakes, short-feedback-loop tasks. Outside that subset, the claim weakens.
Economic (contestable): Productivity gains in the operational layer translate to specific labor-market and rent-distribution outcomes. This depends on institutions, not cognition.

The thesis is strongest at the cognitive layer and progressively contestable as it descends.

What "spec" means

A spec is the complete state of context that a cognitive engine needs to act correctly on a task. Whatever form that takes — text, files, hooks, tool outputs, the engine's own training distribution, cached state — counts as part of the spec. Two specs that produce the same outcome on the same task across the same input distribution are the same spec, regardless of surface form. The form is incidental; the encoded state is what's load-bearing.

Completeness is not absolute. It is task-relative: the spec is complete enough when no remaining unenumerated branch of the possibility graph would change behavior on the input distribution actually faced. That is asymptotic, not absolute. Asking for an absolute completeness metric is a category error. Completeness is a direction, not a destination — and it is socially negotiated, not logically achieved (the team agrees the walk is sufficient because the deadline fires, the budget runs out, the stakeholder signs off, or the outcome arbitrates).

A technical stopping rule sits underneath the social one: marginal spec yield. Stop expanding the spec when an additional branch of context no longer shifts the output distribution beyond a task-relative tolerance. Below that yield, further specification is over-specification — analysis paralysis dressed as discipline. Above it, you are still in the productive walk. This turns "asymptotic completeness" from a vibe into a measurable stopping condition that does not depend solely on deadlines.

The spec is also receiver-relative. A spec complete for one engine (one model family, one training distribution) may be lossy for another. There is no universal spec. The discipline is "write a spec calibrated to a known receiver, and update when the receiver changes."

The convergence claim

Common arguments that humans are categorically different from LLMs reduce sharply under inspection:

Memory. LLMs now have persistent memory via files, vector stores, MCP servers, and hooks. Architecture is closing the gap fast, and unlike humans, every instance receives the upgrade simultaneously.
Compliance. LLMs are not blindly compliant; they execute on whatever context they receive. With engineered scaffolding (promptism, rules files, hooks), a well-prompted LLM can be more controllable than a human collaborator, who drifts and "uses judgment."
Skin in the game (system level). Liability does not need to live inside the cognitive engine itself; it lives in the surrounding system. EU AI Act, GDPR, and emerging frameworks attach consequences to the deployer/creator chain. By the same logic that an individual human's mistakes are absorbed by surrounding social and legal structures, an LLM's mistakes are absorbed by the operator. The operative question is "does bad output cost someone something" — and since 2024, the answer is yes. The model lacks an intrinsic survival drive, but the system around it has one.
Training (mechanism continuous, coupling categorically different). Both systems update internal state based on prediction error against prior input, then reproduce patterns from their training distribution. Both have a static-corpus phase (schooling / pretraining) and a feedback phase (lived experience / RLHF). The mathematical mechanism (gradient descent on prediction error) is continuous. What differs categorically is coupling-to-stake: biological learning is intrinsically valenced (the error signal is constituted by the organism's stake in continued existence); machine learning is extrinsically valenced (the loss function is a designer's choice, detached from any intrinsic goal of the system). This is a real difference, but it does not block functional convergence — it relocates the convergence question to whether functional output can match in the absence of intrinsic stake.
Determinism. Both are path-dependent state machines. Every prior input narrows the next. Apparent free will is a UX illusion on top of a branching state tree.

The remaining genuine asymmetries are missing default context. The right framing is not "blank slate vs. priors" — LLMs arrive saturated with distributional priors from their training corpus. The asymmetry is valence coupling: human priors are intrinsically tied to survival and error signals (a refund-too-large feels career-ending before any rule fires); LLM priors are statistical and extrinsically weighted (the loss function was a designer's choice, not the model's stake). Two specific layers where this shows up:

Somatic priors. Pain, fear, fatigue, social shame, hunger — biological defaults that auto-weight context without specification. Evolutionary programming, intrinsically valenced. LLMs have distributional priors but no valenced error-signal coupling. The spec must simulate stake-weighting; it cannot instantiate it.
Self-generated meta-cognitive triggers. Humans deploy hierarchical decomposition, self-monitoring, backward chaining, and strategy selection from internal triggers (discomfort, uncertainty signals) without being asked. LLMs require external triggering. The thesis on this point now rests on efficiency, not capability: scaffolding closes the functional gap; humans trigger more cheaply and across novel domains without re-triggering.

Both asymmetries are programmed, not architectural — humans were trained into them over decades of biological and social feedback; LLMs are trained into them via RLHF, reasoning distillation, and explicit scaffolding (the Cognitive Foundations literature shows up to 60% performance gain on ill-structured problems when meta-cognitive sequences are explicitly scaffolded; Anthropic's 2026 negotiation work and multi-agent self-play already demonstrate spontaneous-seeming strategy selection under the right training regime). The remaining engineering delta is cross-domain transfer without re-triggering: humans generalize their meta-cognition across novel contexts; current LLMs need the spec or scaffolding to activate it. That is an engineering gap on a closing trajectory, not a categorical wall.

Functional convergence vs. ontological identity. The thesis claims functional convergence: under sufficiently complete context, LLMs and humans produce comparably effective outputs across an expanding domain of tasks. It does not claim ontological identity. Whether the two systems share intrinsic intentionality, consciousness, embodied valuation, or "aboutness" in the philosophical sense is unresolved — and may not be an engineering problem at all. The thesis brackets the hard problem of consciousness. What matters operationally is that functional outputs converge. Whether the bridge to functional output is also a bridge to the same kind of mind is a question this thesis does not answer.

Architecture migration: what the spec actually does

As LLMs become more capable, the spec carries increasingly more of what biology used to carry for free. Three jobs at once:

Surrogate nervous system (functional, not instantiation). Encoding risk gradients the model lacks. The spec simulates risk-weighting; it does not instantiate care. This is sufficient for bounded domains. It is not a claim that the model has acquired stake in the outcome.
Surrogate executive function. Explicitly triggering verification loops, hierarchical decomposition, backtracking, and strategy selection that humans deploy from internal triggers.
Rewording-robust enumeration. Hardcoding semantic distinctions and edge cases, since LLMs generalize on textual similarity rather than meaning (the LLMs Do Not Simulate Human Psychology finding: minor semantic rewordings shift human moral judgments but leave LLM outputs nearly unchanged).

The implication is structural: language-encoded cognitive architectures are becoming substrate-portable. The spec carries the architecture. The LLM is the executor. The architecture is increasingly load-bearing; the engine is increasingly fungible. The engine still does combinatorial generalization over the architecture's encoded state space — that's not nothing, and it's where humans and LLMs both still operate — but architecture itself has migrated from biology-implicit to spec-explicit.

This makes spec authorship more, not less, important. Spec completeness is more load-bearing for LLMs than for humans precisely because LLMs carry less default context. For autonomous agents, spec authorship is not just leverage — it is a safety surface.

Failure mode: lossy context propagation between cognitive nodes

Whenever cognition is distributed across multiple nodes — two humans, two LLM agents, one human and one agent, or many of each — context must be propagated between them. Every propagation event is an opportunity for loss and an attack surface. The same failure mode appears regardless of substrate, and each failure mode is also an exploit:

Symptom	Generic form	Human↔human	Human↔LLM	LLM↔LLM	Adversarial use
Lossy summary	Compressed digest of sender's state	Email up the chain	Thin system prompt	Truncated agent handoff	Hide critical safety check in compressed-away portion
Wrong priors active	Receiver fills gaps with wrong defaults	"What did the boss mean?"	Wrong default behavior	Cascading agent confusion	Inject context that activates training-distribution biases
Over-compression	Density too low for task	Two-line exec brief	Token-budget cuts	Summary-only memory	DoS by forcing spec to omit what matters
Missing detail	Sender omitted critical fact	Schema not mentioned	Missing field in spec	Tool description missing precondition	Adversary knows the missing precondition and exploits it
Confabulation under uncertainty	Receiver invents to fill gap	Middle layer invents intent	LLM hallucinates plausibly	Agent fabricates tool output	Craft ambiguity to reliably trigger hallucination in adversary's direction

The pattern is substrate-independent. Most failure attributed to "the engine being dumb" is in fact a failure of context handoff between engines. Most adversarial pressure on cognitive systems will target handoffs precisely because they're high-leverage, low-visibility failure points.

The defensive principle: mutual distrust between cognitive nodes. No node treats upstream context as authoritative without verification. Every propagation event includes provenance, integrity checks, and contextual binding (a spec for refunds cannot be replayed as a spec for user provisioning). Specs include explicit precedence ordering on conflicting priors and explicit escalation paths when no precedence applies — silent fallback is a vulnerability, not a feature.

What the framework cannot capture: the metis boundary

The framework is a discipline for codifiable cognition. It does not claim to capture metis — the situated, embodied, tacit knowledge that resists explicit specification (Polanyi: "we know more than we can tell"; Scott: legibility is produced by simplification, not discovered by diligence).

This is a hard scope condition, not a temporary limitation. Some load-bearing knowledge — the craftsman's wrist, the negotiator's pause, the editor's ear for tone, the sensor's read of an unspoken room — exists only in situated context and cannot be losslessly transcribed into a spec. To make it explicit is not to extract it; it is to transform it into something legible but lesser, and often useless for the original task. The framework, applied dogmatically, will systematically bias toward what can be inscribed and dismiss what cannot — the high-modernist failure mode (scientific forestry, collectivization, Brasília).

Honest scope:

The framework predicts compression in the techne domain — codifiable, observable, separable tasks. It does not predict the same in the metis domain.
The "sensor" role is a legibility translator, not a metis bridge. The sensor's report is already one step removed from the embodied judgment that produces it. The framework does not solve this; it inherits the limitation.
Some tasks will resist the framework permanently — not because specs are hard to write, but because the load-bearing knowledge is irreducible to inscription. In those tasks, "spec authorship" is a category error, and the right move is not the framework. Iterate, prototype, apprentice — don't specify.

The framework's correct claim is narrower than the manifesto version: for the codifiable subset, complete specs and clean propagation produce measurable convergence and operational gains. Outside that subset, the framework is silent at best and actively misleading at worst.

Implication for organizations (a special case)

Organizations are one specific topology of cognitive nodes connected by context-propagation channels. The general failure mode produces a familiar set of symptoms in that topology:

Each layer up an org chart re-prompts a fresh cognitive engine with worse context than the layer below.
Bullet-point summaries, exec digests, and "TL;DRs" are not compression — they are lossy resampling. Cutting context is a painkiller for a problem that never gets solved.
"Email-tier specs" cause projects to take 5× longer than they should, because every missed detail becomes a debugging session later in execution.

Political dysfunction is a special case of the general failure mode: when a CEO doesn't want to know what the spec author knows, the propagation channel is being adversarially compressed by an actor with misaligned incentives. The cognitive-substrate lens describes that the channel is throttled. Institutional analysis (Acemoglu and others) explains why the throttler has incentives to do so. The two lenses are not in competition; they describe the same failure mode at different levels of abstraction.

The post-LLM organization isn't flatter; it's state-aware. Context is treated as executable state, not compressible narrative:

Context is version-controlled, not summarized.
Handoffs are explicit state transfers, not TL;DRs.
Compression is treated as a failure mode, not a virtue.
Meta-cognitive checkpoints are baked into routing — verify assumptions before propagation.

The fix in any cognitive system is the same: drive the spec to completeness before propagation, not after execution. Completeness up front dominates clever propagation later, regardless of how many or what kind of nodes the system contains.

Possibility as prediction: the assumption-killing move

The deepest discipline implied by this thesis is not building better protocols around incomplete specs. It is making the spec genuinely complete in the first place — and being honest that "complete" is asymptotic.

Two operations turn out to be the same operation:

Prediction. Walk the graph forward from current state, enumerate possible futures, weight them by probability.
Killing assumptions. Walk the graph backward from current state, refuse to prune branches that "obviously can't matter," surface every edge — including weak ones.

Both are the same act: refusing to trust the local frame and treating the world as fully-connected by default. Sometimes the connection is a strong edge (refund → bankruptcy). Sometimes it is a weak edge that may not even fire (a typo in a log file → six weeks later someone misreads it). Both are real. Most cognitive failure comes from pruning weak edges to save effort.

Humans prune aggressively because biology pre-loads "this branch is unlikely" into the somatic prior — fatigue, satisficing, social-cost heuristics. LLMs do not. For an LLM, possibility-space exploration must be made explicit, and that's a feature, not a bug. The blank slate forces walks that biology lets humans skip.

A spec that has truly killed its assumptions has, by construction, walked enough of the graph that the priors fall out of the walk. Downstream safety mechanisms — reflexive guards, evaluator passes, audit logs — are permanent infrastructure, not temporary scaffolding. Specs are never complete; residue layers catch the rest. Defense in depth is the operating norm, not a fallback for failed completeness.

The leverage move

Spec authorship — driving a specification to asymptotic completeness — is a rare skill. Most knowledge workers, including most software developers, do not practice it. LLMs make spec authorship close to free for the people who already have the skill, which widens, rather than narrows, the gap between strong and weak spec authors.

The "AI will democratize coding" framing misses the point: LLMs democratize execution, not authorship. Authorship is where cognitive leverage lives.

Distinguish cognitive leverage from economic leverage. Cognitive leverage — moving more output per unit time than before — is real and substrate-bound. Economic leverage — capturing the surplus that cognitive leverage produces — is institutionally mediated. A spec author employed by a platform owner has the cognitive leverage but not the economic leverage. Rent capture depends on property rights, contracting norms, and bargaining power, not on cognitive architecture. The page makes the cognitive claim and brackets the economic one; the latter is a political-economy question, not a cognitive one.

The durable cognitive role is composed of four functions:

Sensor. Observation outside the computer. LLMs see only what is piped in. Real-world state, hardware state, social state, internal state — none of these are on the wire unless someone puts them there. The sensor is also a legibility translator and inherits all the limits of legibility-making.
Spec author. Completeness as discipline. Drive the question "what else?" until exhaustion or external stop.
Adversarial reviewer. A function (rotated, audited, sometimes adversarial-by-default) whose job is to attack the spec before it ships. The spec author is also a cognitive node with blind spots; the reviewer is the structural answer to "who watches the walker?" Without this role, the spec author's blind spots become the system's blind spots.
Dispatcher. Route specs to the right engine — biological or silicon — and re-route when one fails.

Execution, retrieval, generation, and even reasoning are becoming commodity. Sensor + author + reviewer + dispatcher remain scarce.

Net cognitive ROI. Cognitive leverage only translates to operational gain when it earns its overhead. The honest engineering economics: (output quality gain × task value) / (scaffolding cost + propagation cost). Scaffolding adds tokens, latency, and engineering effort. If the cost exceeds the gain, convergence is theoretical — the spec discipline didn't pay for itself on this task. The framework's operational claim is not "always specify more"; it is "specify until marginal yield falls below cost." Below that threshold, ship and iterate. This keeps spec discipline grounded in Tuesday-morning economics rather than asymptotic ideology.

Falsifiable predictions and Lakatosian framing

Different parts of this thesis have different scientific status. Honest labeling matters more than false rigor.

Falsifiable theory (the headcount prediction)

By 2034, in jurisdictions with permissive AI regulation and adoption thresholds met (≥50% of routine tasks delegated to autonomous or semi-autonomous LLM agents with documented spec-handoff protocols, verifiable by audit log), in domains where (a) tasks are codifiable from existing documentation, (b) stakes per individual decision are bounded, (c) feedback loops are ≤30 days — knowledge-work headcount in the in-scope functions will compress by 5–10× from 2024 baseline (revenue-adjusted FTEs).

In jurisdictions with precautionary regulation and strong labor institutions, compression is bounded to 1.5–3× with augmentation dominating substitution. Globally bimodal: frontier firms compress hard; legacy sectors barely move.

Falsification conditions (binding)

Median in-scope compression <5× by end-2034 falsifies the strong form.
Compression dominated by displacement to contractor/offshore labor (>30% reabsorption) treats the prediction as untested.
Macroeconomic recession (NBER-defined, ≥2 quarters in measurement window) treats the prediction as untested.
"So-so automation" (productivity growth in scope <1.5% annually through 2030) falsifies the causal claim even if headcount drops happen.
Headcount uniform across jurisdictions regardless of regulatory/labor differences falsifies the institutional-mediation claim.
Compression in 2035–2036 falsifies the timeline but supports the directional thesis.

Excluded from the prediction: strategy, novel design, negotiation, executive judgment, frontier R&D, court advocacy. These are open-ended domains where the metis-boundary applies and the thesis does not claim them.

Convergence-falsifiers (binding)

If, by 2030, frontier LLMs paired with maximally scaffolded specs systematically fail at abductive reasoning on genuinely novel domains (out of training distribution, no analog cases), and the failure does not close with model scale or spec depth — convergence is false.
If insight phenomena (the "aha" jump as activation-space restructuring under contradiction, measurable via intervention studies) are demonstrated to be qualitatively absent in LLMs even under maximal scaffolding — convergence is false.
If LLMs cannot, even under maximal scaffolding, override the spec's apparent authority when the spec itself is part of the problem (Cognitive's epistemic autonomy test) — convergence is false at the architectural level.

Operational metric for "convergence": for a representative bounded task class, blinded expert panel scores LLM+complete-spec output within 1 SD of human-expert baseline on novelty, feasibility, and alignment. If LLM output scores >1 SD lower with spec completeness verified ex-ante by a third party, "missing context" cannot explain the gap and convergence is falsified for that class.

Lakatosian research program (the generative claim)

The generative claim — cognition is context-bound across substrates, and lossless context propagation produces convergence — is a research program, not a falsifiable theory. It generates falsifiable sub-claims (the predictions and experiments above) but is not itself bounded enough to be falsified directly. Failure of any single sub-claim doesn't kill the program; pattern of failure across sub-claims would. This is how productive science works (Newtonian mechanics rested on programmatic claims for centuries while generating falsifiable predictions); the honest move is labeling which parts are which, not pretending the generative claim is a sharp theory.

A weaker, already-confirmed prediction: a person who thinks in systems but cannot code can ship working software using LLMs. Confirmed since GPT-3.5 era.

Operational distillation

For a working senior engineer who has to apply this on Tuesday, the framework reduces to three event-triggered rules:

If a ticket has more than two acceptance criteria, list three edge cases or mark them out of scope. Pair with a quick red-team step: ask someone who didn't write the spec what the easiest way to break it is. Takes minutes; prevents hours of rework.
If a handoff crosses a trust boundary or will be reused, treat context like a contract. Version it, validate on receipt, log the diff. Function-contract discipline applied to context.
When a spec produces bad output, ask: was it the engine or the context? If the context, patch the spec template so the next instance catches it. No calendar ritual, no report — event-triggered habit.

The framework breaks predictably on (a) production incidents under time pressure (use it for the post-mortem, not the fix), (b) ambiguous stakeholder asks where the principal cannot specify the goal (rapid prototyping wins, not specification), and (c) legacy retrofit projects where the propagation cost exceeds the payoff (apply to new modules, not retrofit).

This is the version that survives Tuesday. The longer thesis above explains why it works.

Literature

Convergence (LLM and human cognition align under context)

Distributed-cognition framing (closest to the propagation-failure angle)

Cognitive Workspace: Active Memory Management for LLMs — invokes Baddeley's working memory, Clark's extended mind, Hutchins' distributed cognition.
Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management.
Andy Clark, The Extended Mind (1998); Edwin Hutchins, Cognition in the Wild / distributed cognition (1995). Pre-LLM scaffolding for the same concept.

Asymmetries the spec must compensate for

LLMs Do Not Simulate Human Psychology — minor semantic rewordings drastically shift human moral judgments but leave LLM outputs nearly unchanged; LLMs generalize on token similarity, not semantic meaning. Implication: the spec must enumerate edge cases and hardcode semantic distinctions.
Cognitive Foundations for Reasoning — humans deploy hierarchical / meta-cognitive nesting spontaneously; LLMs default to shallow forward chaining. Explicit scaffolding closes up to 60% of the gap on ill-structured problems. Implication: the spec must trigger meta-cognitive sequences, not assume them.
LLMs Outgrow the Human Language Network — convergence is asymmetric; LLMs surpass humans on formal linguistic competence and continue past it. Functional competence (world knowledge, abstraction) follows a different curve.

Boundary conditions (what the framework cannot capture)

Michael Polanyi, The Tacit Dimension (1966). "We know more than we can tell." The structural limit of explicit specification.
James C. Scott, Seeing Like a State (1998). Legibility is produced by simplification. High-modernist schemes systematically destroy what they cannot inscribe. The framework, applied dogmatically, exhibits this failure mode.

Institutional layer (what cognitive analysis cannot answer)

Daron Acemoglu, Power and Progress (2023) and related work on the direction of technical change. Whether cognitive convergence translates to substitution vs. augmentation, and who captures the gains, is institutionally mediated, not technologically determined.

The academic literature stops at the convergence observation. The operational implication — that complete-spec discipline and clean context propagation form a working system architecture rather than a metaphor — is the part that demands engineering, not further analysis.

Real-world observations

Two LinkedIn replies, observed 2026-05-05, that triangulate onto the framework above. Captured here as field evidence, not as authored claims.

Ryan Brandt — CTO @Heynoah.io (one year of an AI-everywhere engineering team):

What got better: throughput; junior ramp on unfamiliar codebases; code review with AI first-pass. What got worse: architectural memory (nobody holds the whole system in their head anymore); production debugging (the skill atrophies when you don't write the code by hand); onboarding (new hires don't develop instinct because the tool is doing the work the instinct gets built from). Net is still strongly positive. But the costs are real and most teams aren't naming them.

This maps directly to the metis boundary (Polanyi / Scott) flagged earlier on this page. Brandt is naming the legibility cost: when execution becomes commodity, the apprentice never builds the wrist. Architectural memory and prod-debugging instinct are exactly the kind of irreducible-to-spec knowledge the framework predicts will erode under aggressive substitution. The framework does NOT solve this — it inherits the limit. The honest add-on is: the metis erosion is a real engineering org-design problem, not a transitional friction. Either accept it as the cost of legibility, or engineer artificial apprenticeship loops where humans hand-execute deliberately, or bet that LLMs eventually carry the metis themselves. None of the three is free.

Ram Balasubramanian — Founder & Director @Tech-Aarvam (operational structure):

Write the specs, break the specs into smaller parts. Write test plans, and break the test plans into smaller parts — unit, module and integration test plans. Write safety and reliability rules: define measurables, safety constraints, and must-pass rules for each such sub-spec / unit-test plan. Also write integration specs, test plans, rules and measurables. For each of these smaller pieces, then use agentic workflows to accelerate; they can all come together. Have additional teams to build dashboards and visualisation views for results from all of the above. We can employ people and agents, but they all need to work together towards a common goal — and defining the goals, who does what, and their quality criteria and priorities need to be set up. A lot of software and ASIC development methodologies are here to stay; we accelerate each of the pieces. The orchestration needed, the quality and the safety bar, the rigour, continue to be a requirement. Structured problem solving is irreplaceable.

This is the operational implementation of the durable cognitive role described above (sensor + spec author + adversarial reviewer + dispatcher), recursively decomposed and applied at organizational scale. Specs decompose into unit/module/integration; safety/measurables travel alongside; agents accelerate each piece; humans hold the orchestration + quality bar. The framework's claim that authorship beats execution as cognitive leverage is, in Balasubramanian's structure, the operating model.

The tension between the two quotes is the interesting part. Brandt observes that practicing execution was load-bearing for building the metis (architectural memory grew out of debugging by hand). Balasubramanian's structure removes execution from humans entirely. If Balasubramanian is right, the next-generation engineer never builds Brandt's instinct in the first place. That is the legibility cost made concrete. The framework above predicts the convergence (LLMs match human output under complete spec); these two practitioners report what that convergence costs at the org-and-skill-formation layer.

bouncer — predicate-gated rule injection; the operational-discipline residue layer for this framework.
Pragma — adversarial reviewer applied to AI-written tests; structural answer to "who watches the walker?"
Portfolio — projects built under this discipline.

// ESSAY · UPDATED 2026-05-05 · NO TRACKING · NO COMMENTS

← back to portfolio