Mark Williams
Mark Williams
Apr 10, 2026
AI Engineering

This is the second article in "The Meta-Engineer," a three-part series examining how AI is reshaping the identity and skill set of software engineers. The first article is "Context is the New Code."

The first article in this series described a new category of software artifact, configuration files that tell AI coding agents how to behave within a particular codebase. Those files have measurable impact on agent efficiency and output quality. But they immediately raise a deeper question. If structured context is the foundation of effective agent use, who creates it, and what does the rest of the work actually look like?

The popular narrative about coding agents splits into two contradictory claims. One holds that agents are replacing developers, writing code at a pace no human can match. The other insists they are merely fancier autocomplete, useful for boilerplate but incapable of real engineering. A growing body of field research, large-scale repository analysis, and detailed practitioner case studies supports neither version. Professional developers are using agents extensively, but in a mode that looks nothing like "vibe coding," the practice of trusting AI output without careful review. They plan, supervise, validate, and increasingly build elaborate infrastructure to keep agents effective across complex, long-running projects. The work has not disappeared. It has changed shape.

What Professionals Actually Do

A field study combining 13 in-depth observations with a qualitative survey of 99 experienced developers found a consistent pattern [1]. Professional developers value agents as a productivity boost, but they retain authority over software design and implementation. They plan before implementing, validate all agent outputs, and insist on fundamental quality attributes like maintainability, test coverage, and architectural coherence. Developers found agents well-suited to straightforward, well-described tasks but not to complex ones involving architectural judgment or unfamiliar domains. The relationship resembles less a pair programming partnership and more a delegation arrangement where the human sets the specification and reviews the results.

"The role is more… if you think of it like a conductor of sorts as opposed to the actual instrument player."

Practitioner quoted in Chang et al., 2025

A separate qualitative analysis of 57 practitioner videos published between late 2024 and October 2025 confirmed a complementary picture [2]. Developers consistently describe their evolving role using the metaphor of a conductor, someone who directs rather than plays. The cognitive load has not decreased so much as shifted. Instead of grappling with syntax, APIs, and repetitive implementation details, developers devote greater attention to domain modeling, architectural decisions, and system integration. Natural language has become the primary medium of software composition, but the reasoning behind that language, the judgment about what to build and why, remains firmly human. The study also raised a specific warning about junior engineers who accept AI output without understanding it, creating what practitioners described as "house of cards" solutions that compile and pass tests but rest on foundations no one in the room actually understands.

The scale of adoption is already substantial and growing fast. A study of over 129,000 GitHub projects found that between 15.8% and 22.6% show traces of coding agent use, a remarkably high figure for tools that have existed in their current form for less than a year [3]. Agent-assisted commits tend to be larger than purely human commits and focus disproportionately on features and bug fixes, suggesting developers use agents for substantive production work rather than experimentation. A complementary dataset of over 456,000 agent-generated pull requests (proposed code changes submitted to a repository for review) across 61,000 repositories reinforced the trend [4]. OpenAI Codex alone produced more than 400,000 pull requests within two months of its release. Developers appear to work in two distinct modes, using agents for "acceleration" on familiar tasks where the goal is speed, and for "exploration" of unfamiliar design spaces where the goal is learning. The relevant productivity question, one that frameworks like SPACE address by measuring satisfaction, collaboration, and efficiency alongside raw throughput, is not how fast agents generate code but how effectively the combined human-agent system produces correct, maintainable software.

What these studies collectively describe is neither replacement nor mere assistance. The developer's contribution has shifted from producing code to producing specifications, constraints, and quality judgments, a transition that turns out to demand more expertise rather than less.

When a Config File Isn't Enough

The configuration files described in the first article, CLAUDE.md and AGENTS.md, work well for modest-sized projects. A few hundred lines of instructions can orient an agent to a codebase's conventions, testing expectations, and architectural patterns. But what happens when a project reaches 108,000 lines of code, spans 45 subsystems, and defines 35 network message types? A single file no longer suffices.

Library shelves organized in layered tiers of books and reference materials, analogous to how complex software projects require layered knowledge architectures for AI agents rather than a single instruction file

Three Tiers of Machine Memory

Just as a large library organizes its holdings into different levels of accessibility, with reference materials on open shelves, specialized texts in reserve, and archival documents retrieved on request, a sufficiently complex software project needs layered knowledge infrastructure for its AI agents. A detailed case study documented exactly what this looks like [5]. A researcher built a 108,000-line C# distributed system using Claude Code as the sole code-generation tool, developing a three-tier context architecture across 283 development sessions. The first tier, a "hot memory" constitution of roughly 660 lines, loaded into every agent session automatically. It encoded naming conventions, build commands, and orchestration protocols. The second tier comprised 19 specialized domain-expert agents, each responsible for a specific subsystem like networking, physics, or UI, totaling around 9,300 lines. The third tier was a cold-memory knowledge base of 34 on-demand specification documents served through a retrieval tool only when relevant. The total context infrastructure amounted to about 26,000 lines, roughly 24% of the codebase it supported.

The detail that the researcher's primary background is in chemistry, not software engineering, inverts a common assumption about who can do this kind of work. Building complex software with agents may depend less on traditional coding skill and more on the ability to design knowledge architectures, to decompose a problem domain into structured components and write clear specifications for each. That is an architectural competency, but not necessarily a programming one. The context infrastructure itself was AI-generated under human architectural direction, with the human's role being to decide what knowledge to capture and how to organize it.

Similar infrastructure patterns appear in other systems. A technical report on the OpenDev terminal agent described five-stage progressive context compaction that activates at increasing token pressure thresholds, from 70% to 99% of the model's context window capacity (the maximum amount of text it can consider at once) [6]. To counteract "instruction fade-out," the phenomenon where agents gradually stop following their original instructions as a conversation grows longer, the system injects event-driven reminders at key decision points rather than relying solely on the initial prompt. A three-tier Skills hierarchy, spanning built-in, project-level, and user-defined instructions, manages reusable templates through lazy loading, injecting only what each specific task requires. These are infrastructure-level solutions to a problem that anyone running a long agent session has encountered.

Multi-agent approaches add another dimension of complexity and capability. A study of context engineering for coordinated coding systems found that retrieving both external knowledge (research papers and documentation) and internal codebase context (project files and conventions) substantially improved task resolution on the SWE-Bench Lite benchmark, a widely used test suite for evaluating whether agents can resolve real GitHub issues [7]. The multi-agent approach yielded higher single-shot success rates than single-agent baselines, at the cost of roughly 3 to 5 times more tokens per task. Dividing work among specialized sub-agents, each operating within a focused context window, reduced hallucinations (plausible but incorrect AI-generated content) and improved adherence to project conventions. But orchestrating multiple agents introduced its own complexity. Someone had to design the task decomposition, define agent roles, and ensure shared state remained consistent. For now, that someone is a human.

The Orchestration Paradox

The orchestration patterns that professionals develop, decomposing tasks, routing work to specialized agents, maintaining shared memory across sessions, represent genuine engineering skill. They also represent the next thing likely to be automated.

The Darwin Gödel Machine demonstrated this directly [8]. Rather than relying on a fixed, human-designed coordinator to direct improvements, the system iteratively modified its own codebase, including its own orchestration logic, and empirically validated each change against coding benchmarks. On SWE-bench, it improved performance from 20% to 50%. On the Polyglot benchmark, which tests across six programming languages, it improved from 14.2% to 30.7%. The key architectural insight is that this is a single system that both solves coding problems and refines its own implementation, removing the need for a separate, hand-crafted meta-agent. The better tools and workflows it discovered were not anticipated by its designers.

When the Wrapper Becomes Redundant

Just as a machine tool capable of manufacturing other machine tools represents a fundamentally different category than one that merely stamps out parts, a coding agent that can edit its own source code occupies a different position than one that simply follows instructions. The SICA system (Self-Improving Coding Agent) demonstrated this by autonomously modifying its own Python codebase, improving from 17% to 53% on a subset of SWE-Bench Verified [9]. One finding proved particularly telling. When a reasoning model was provided as a sub-component, crude reasoning scaffolds that SICA had built for itself actually hurt performance, because the model's native reasoning was better than the agent's self-designed wrapper. This is a concrete instance of a recurring compression pattern, where a layer that was necessary at one capability level becomes counterproductive when the underlying system matures.

Close-up of a circuit board with intricate interconnected pathways, analogous to self-modifying agent systems that rewire their own logic to improve performance

Meanwhile, trajectory-informed memory generation already automates the extraction of structured lessons from agent execution histories [10]. Rather than relying on humans to document what worked and what failed after each session, the system analyzes completed task trajectories, identifies which decisions led to successes or failures through causal attribution, and generates categorized guidance for future runs, including strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient successes. On the AppWorld benchmark, this approach improved task completion by up to 14.3 percentage points, with the strongest gains on the most complex tasks. This is essentially automating the "lessons learned" process that the codified context researcher performed manually across 283 development sessions.

The pattern is consistent across these studies. The conductor role that practitioners are developing right now is structurally similar to what self-improving systems are learning to do autonomously. Decomposing tasks, routing to specialists, and refining strategies based on execution feedback are exactly the capabilities that agent systems are acquiring through their own operation. Code completion automated the first layer of developer effort. Context configuration is being formalized now. Orchestration appears to be next.

The final article in this series will take up the question this observation raises. If the orchestration layer compresses too, what remains durably human? The answer, the evidence across these studies suggests, has less to do with any particular abstraction level and more to do with wherever meaning is still underdefined.

References

  1. R. Huang et al., "Professional Software Developers Don't Vibe, They Control: AI Agent Use for Coding in 2025," arXiv, 2025, [Online]
  2. H.-F. Chang et al., "Coding With AI: From a Reflection on Industrial Practices to Future Computer Science and Software Engineering Education," arXiv, 2025, [Online]
  3. R. Robbes et al., "Agentic Much? Adoption of Coding Agents on GitHub," arXiv, 2026, [Online]
  4. H. Li et al., "The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering," arXiv, 2025, [Online]
  5. A. Vasilopoulos, "Codified Context: Infrastructure for AI Agents in a Complex Codebase," arXiv, 2026, [Online]
  6. N. D. Q. Bui, "Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned," arXiv, 2026, [Online]
  7. M. Haseeb, "Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code," arXiv, 2025, [Online]
  8. J. Zhang et al., "Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents," in Proc. International Conference on Learning Representations (ICLR), 2026, [Online]
  9. M. Robeyns et al., "A Self-Improving Coding Agent," arXiv, 2025, [Online]
  10. G. Fang et al., "Trajectory-Informed Memory Generation for Self-Improving Agent Systems," arXiv, 2026, [Online]

Discuss This with Our AI Experts

Have questions about implementing these insights? Schedule a consultation to explore how this applies to your business.

Or Send Message