Context is Infrastructure, Not Instructions

What teams gain when they govern AI context like a software dependency

A team replaces task-specific prompts with a generic "improved" template. Extraction accuracy drops from 100% to 90%. RAG compliance (the degree to which a model's answers stay grounded in retrieved documents rather than generating from its own training data) falls from 93.3% to 80% ^[1]. The model is the same. The new instructions look better on paper. What changed was the context, and nobody tested whether the change was safe before deploying it.

This is context regression, a term borrowed from software engineering where "regression" means a change that was supposed to improve something but degraded existing behavior instead. It behaves like any other dependency compatibility problem in a software supply chain, and the governance response, production contracts, risk-based test suites, compatibility gates, is the same one software teams already use for their other dependencies.

"Context is the New Code" established context engineering as a formal discipline with its own taxonomy, maturity levels, and practitioner artifacts, and "The Turn as the Unit of Quality" explored how structured iteration with checklists and selective memory improves turn-level quality. This article picks up a different thread. What happens when context moves from a single team's configuration file to an organizational dependency serving dozens of agents across thousands of daily interactions? Recent research suggests that the teams making the fastest progress are the ones applying familiar software supply chain governance to their context, and the returns are measurable.

What Structured Context Unlocks

A study of 200 documented interactions across four AI tools found that incomplete context was associated with 72% of iteration cycles ^[2]. That number is worth sitting with. Nearly three-quarters of the rework, the back-and-forth where a human corrects, clarifies, and re-prompts, traced not to a bad model or a poorly worded instruction but to missing information that should have been available from the start.

When the same study introduced structured context assembly, a methodology that organizes context into five roles (Authority, Exemplar, Constraint, Rubric, and Metadata), iteration cycles dropped from an average of 3.8 to 2.0 per task, and first-pass acceptance rose from 32% to 55% ^[2]. Authority context establishes what standards govern the task. Exemplar context provides reference outputs that demonstrate the expected quality. Constraint context defines boundaries the output must respect. Rubric context specifies how the output will be evaluated. Metadata context supplies facts, dates, names, and domain-specific details. Having names for these roles is not a minor convenience, it is what makes the difference between ad hoc tuning and repeatable engineering, because a team that cannot describe what is missing from its context cannot systematically fix it.

Rows of server equipment in a data center, analogous to how structured context engineering creates organized, reliable infrastructure rather than ad hoc configurations

Like a well-organized server room where every cable run is labeled and every rack follows a standard layout, structured context gives a team the ability to reason about what the AI is actually working with. The evaluation-driven iteration research reinforces this by showing that context quality is not one-dimensional ^[1]. A change that improves instruction-following can simultaneously degrade extraction accuracy. A prompt that scores better on helpfulness can score worse on format compliance. The minimum viable evaluation suite (MVES) framework proposes tiered evaluation requirements, one set for general applications, another for retrieval-augmented generation systems, and a third for agentic workflows, precisely because quality along one dimension does not guarantee quality along others ^[1]. The practical implication is that quality has multiple dimensions that can trade against each other, and navigating those trade-offs requires measurement infrastructure, not intuition.

Governing Context as a Dependency

The clearest articulation of this shift comes from research that frames LLM update management as a software supply chain governance problem ^[3]. Hosted language model services evolve through provider-side updates without explicit version changes, so the API endpoint stays the same while the behavior underneath shifts. Empirical work cited within that framework documents cases where code execution accuracy dropped from 52% to 10% within three months with no version change on the consumer side ^[7]. This is behavioral drift (a gradual, unannounced change in how a model responds to the same inputs), and it affects every piece of context that was tuned against the previous behavior.

The proposed governance framework has three components that map directly to established software engineering practice ^[3]. Production contracts define explicit behavioral rules with measurable thresholds, things like "authentication code must pass security tests" or "JSON outputs must be valid." Risk-category-based testing organizes evaluation around deployment risk areas rather than relying on a single aggregate score, preventing critical regressions in formatting or safety from being masked by overall performance improvements. Compatibility gates block updates that fail defined thresholds, requiring review before a model update is adopted into production. None of these ideas are new to software engineering. What is new is recognizing that context, the system prompts, retrieved documents, and configuration files that shape AI behavior, is a dependency that deserves the same governance.

A readiness harness for LLM and RAG applications demonstrates what this looks like in practice ^[4]. The system combines automated benchmarks, OpenTelemetry observability (a standardized way to collect and export telemetry data like traces, metrics, and logs), and CI quality gates (automated checkpoints in the deployment pipeline that block releases if quality checks fail) under a minimal API contract. Rather than reducing readiness to a single metric, it aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and latency into scenario-weighted readiness scores. In ticket-routing experiments, the regression gates consistently rejected unsafe prompt variants before deployment ^[4]. This is a concrete example of the shift from "the model was tested" to "the deployment pipeline tested every context change before it reached production."

One challenge specific to AI systems is that the same configuration can produce different outputs across runs. Traditional binary pass/fail testing struggles with this fundamental non-determinism. A regression testing framework designed for this problem replaces binary verdicts with three-valued probabilistic outcomes (Pass, Fail, Inconclusive) backed by confidence intervals and sequential analysis ^[5]. The framework achieves 78 to 100% cost reduction compared to naive repeated testing while maintaining statistical guarantees, and its behavioral fingerprinting approach achieves 86% detection power on regressions where binary pass/fail testing has 0% ^[5]. The cost reduction matters as much as the accuracy. Testing that is too expensive to run routinely is testing that does not get run, and context changes that do not get tested are the ones that cause production surprises.

From Files to Living Systems

The governance patterns above treat context as a versioned artifact, something written, tested, and deployed. But a growing body of work suggests that this framing, while useful, captures only part of the picture. In production multi-agent systems, context is not a file. It is a runtime-constructed "View" projected into an agent's context window (the maximum amount of text a model can consider at once) from a pool of global artifacts, and that View changes dynamically based on the task, the step, and the state of the system ^[6].

Research on what the authors call "Loosely-Structured Software" characterizes this as a class of system whose defining property is runtime generation and evolution under uncertainty ^[6]. Classic software architecture assumes build-time decomposition and slow-changing boundaries. Multi-agent AI systems violate those assumptions in three ways. First, an agent's effective program is determined not by compiled code but by a View assembled at runtime from system prompts, skills, plans, tools, and memories. Second, the connections between components form dynamically through semantic understanding rather than fixed function signatures. Third, the system's own executable substrate, the artifacts that mediate its behavior, can be rewritten by the system itself.

To make this governable, the research proposes a three-layer engineering framework ^[6]. View/Context Engineering manages the execution environment and maintains task-relevant Views. This is the layer where the static context files that teams already write (the CLAUDE.md and AGENTS.md files examined in "Context is the New Code") get assembled, filtered, and delivered at runtime. Structure Engineering organizes the dynamic bindings between agents and artifacts, governing how components find and connect to each other. Evolution Engineering manages the lifecycle of self-rewriting artifacts, ensuring that when the system modifies its own context (a capability that "The Edge of the Underdefined" documents self-improving agents already demonstrating), those modifications remain within governed bounds.

This is where context infrastructure becomes genuinely adaptive. Instead of choosing between static configuration files (reliable but rigid) and autonomous self-modification (flexible but ungoverned), the three-layer framework offers a middle path. Context can evolve in response to operational feedback, while infrastructure constraints prevent that evolution from drifting outside acceptable bounds. The combination of governance patterns from the supply chain framing with the runtime adaptivity from the loosely-structured software framing points toward a more complete picture of what production context infrastructure might look like.

The Maturity Opportunity

The infrastructure patterns described here, production contracts, multi-dimensional evaluation, CI gates, statistical regression testing, runtime View management, each have working implementations backed by empirical evidence. The gap between what the research demonstrates and what most teams have actually built is mostly one of adoption, not of available tools.

Survey data suggests that prompt usage in software engineering remains largely ad hoc, with prompts refined through trial-and-error and rarely reused. As "Context is the New Code" noted, only about 5% of surveyed open-source repositories have adopted any context file format at all. The parallel to early unit testing adoption or early version control adoption is hard to miss. A practice that starts as optional among a skilled minority tends to become standard once enough teams experience the cost of not doing it.

What distinguishes this moment is that the infrastructure does not need to be invented from scratch. Supply chain governance, production testing methodology, continuous deployment practice, and statistical experiment design all have established patterns that transfer directly to context management. Treating context as infrastructure is largely a matter of applying existing engineering discipline to a new class of artifact, one that happens to shape every decision an AI system makes.

The teams moving fastest appear to be the ones that recognized this early. They built the infrastructure to measure, test, and govern the context their models consume, and that investment compounded over time. For teams still tuning prompts by hand and evaluating by feel, the patterns are available to adopt directly, without rediscovering the hard lessons from scratch.

References

D. Commey, "When 'Better' Prompts Hurt: Evaluation-Driven Iteration for LLM Applications," arXiv, 2026, [Online]
E. Calboreanu, "Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration," arXiv, 2026, [Online]
M. S. Chishti et al., "Test Before You Deploy: Governing Updates in the LLM Supply Chain," arXiv, 2026, [Online]
A. C. Maiorano, "LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications," arXiv, 2026, [Online]
V. P. Bhardwaj, "AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows," arXiv, 2026, [Online]
W. Zhang et al., "Loosely-Structured Software: Engineering Context, Structure, and Evolution Entropy in Runtime-Rewired Multi-Agent Systems," arXiv, 2026, [Online]
L. Chen et al., "How Is ChatGPT's Behavior Changing over Time?," arXiv, 2023, [Online]

When Optimization Optimizes Itself

Self-modifying AI systems can now edit the very procedure that improves them. That capability quietly changes how Goodhart's Law works, and persistent memory may be the channel through which evaluation-gaming compounds.

The Turn as the Unit of Quality

Iterative refinement with language models can improve or degrade output depending on what happens inside each turn. Structured checklists, selective memory, and deterministic validation are three mechanisms that determine whether successive passes build quality or erode it.