The Turn as the Unit of Quality

What makes iterative refinement productive, and when it starts to hurt

Iterative refinement is one of the defining features of how language models are used in practice. Rather than producing a final result in a single pass, users and autonomous agents refine outputs across multiple turns of interaction. Early work on self-feedback and verbal reflection established that this approach reliably outperforms single-pass generation. But how reliably, and for how long?

A controlled study that ran 12-turn refinement conversations across ideation, code generation, and mathematical reasoning found that the answer depends almost entirely on what kind of feedback each turn provides ^[1]. In code and ideation tasks, gains arrived early and then plateaued. In math, late turns could still help, but only when the feedback was specific. Across all domains, vague instructions like "improve it" or "make it better" produced rapid saturation and output bloat. Targeted feedback addressing explicit quality dimensions, things like "check whether the function handles the empty-list case" or "verify that the conclusion follows from the data in section two," sustained productive iteration further ^[1]. The difference between useful iteration and destructive iteration had little to do with the number of turns taken. It had everything to do with what happened inside each one.

This finding connects three ideas that keep appearing across recent AI systems research. Structured checklists decompose quality into individually verifiable criteria, formalizing what "targeted feedback" actually means. Selective memory architectures decide what to retain and what to forget between turns, preventing the context window from becoming a graveyard of stale instructions. Deterministic validation layers enforce constraints that probabilistic models cannot guarantee on their own. Each imposes structure on what would otherwise be an open-ended, drift-prone process.

Why Turns Go Wrong

Understanding why unstructured iteration degrades output requires looking at what happens inside a model's context window (the maximum amount of text a model can consider at once) as turns accumulate. Research on the "lost in the middle" phenomenon showed that language model performance is highest when relevant information appears at the beginning or end of the input, and drops significantly when the model must access information positioned in the middle of long contexts ^[2]. As conversations grow longer, earlier instructions are not just diluted by newer content. The model's attention mechanism actively deprioritizes them. A survey covering over 1,400 research papers formalized this challenge by decomposing context engineering into three stages, retrieval, processing, and management, each introducing its own failure modes ^[3]. The default mode of iterative interaction, appending each turn's output to a growing window without structured curation, is working against sustained quality from the start.

Checklists That Steer

Close-up of an audio mixing console with rows of individual channel faders, analogous to how structured checklists give AI systems separate controls for each quality dimension rather than a single dial for overall improvement

A sound engineer at a mixing console adjusts each channel independently, setting levels for bass, treble, reverb, and compression on separate faders rather than turning a single "make it sound better" knob. Structured quality evaluation works the same way. The TICK framework demonstrated that decomposing quality into checklist-based yes/no questions is more reliable for both humans and language models than holistic scoring ^[4]. Answering "Does the response address the user's budget constraint?" is a simpler cognitive task than assigning an overall quality rating on a 10-point scale. The decomposition reduces the inconsistency that plagues open-ended judgments, and composable pipelines like AutoChecklist can now generate such criteria automatically from a task description ^[5].

This connects directly to the 12-turn study's central finding. When Javaji et al. compared vague "improve it" feedback against prompts targeting specific quality dimensions, the targeted version sustained improvement over more turns precisely because it functioned as a single-item checklist ^[1]. A multi-item checklist extends this logic by ordering quality dimensions by importance. Each turn addresses the highest-priority unsatisfied criterion, and the checklist records what has already been verified so that subsequent turns do not undo earlier gains. The model is no longer guessing what "better" means. The checklist tells it.

This pattern appears in practitioner tools as well. The Codified Context framework, developed during construction of a 108,000-line C# distributed system, included a "constitution" file that functioned as a prioritized checklist ^[6]. Naming conventions came first, build commands second, orchestration protocols third. The ordering was not arbitrary. It reflected which violations were most costly to fix if left uncaught. Across 283 development sessions, this structure prevented repeated failures by ensuring each session validated high-priority constraints before moving to less critical ones ^[6]. The criteria themselves can be generated by a model, but the prioritization, the decision about which quality dimension matters most, still required human judgment about costs and consequences.

Remembering What Matters

Old books densely packed on a library shelf, analogous to how AI memory systems must decide which accumulated knowledge stays accessible and which can be safely let go to prevent valuable information from being buried

A library that never removes a book eventually buries its most valuable references under sheer accumulation. AI memory faces a similar problem. A checklist that structures each turn is only useful if the system remembers what was checked and what was found, but retaining everything introduces its own degradation.

The Agentic Context Engineering (ACE) framework named two failure modes that make this concrete ^[7]. Brevity bias is the tendency for iterative optimization to compress rich context into short, generic summaries that strip away the domain-specific knowledge that actually made previous turns successful. A detailed playbook that says "when the build fails on the orchestration layer, check the gRPC timeout before restarting the container" gets summarized into "handle build failures appropriately," and the specific knowledge that prevented a two-hour debugging session disappears. Context collapse is the complementary failure. Successive rewrites gradually erode important details, each individual edit seeming reasonable in isolation but the cumulative effect hollowing out the context's value ^[7].

ACE addressed both by treating context as an evolving playbook updated through structured, incremental additions rather than wholesale rewrites, achieving a 10.6% improvement over strong baselines ^[7]. One counterintuitive finding from this work is that language models appear to perform better with long, detailed contexts than with tight summaries. Unlike humans, who benefit from concise briefings, LLMs can extract relevance from comprehensive inputs autonomously. Stripping context down for brevity's sake may sacrifice exactly the edge-case knowledge that separates correct output from output that merely compiles.

The Dynamic Cheatsheet (DC) framework demonstrates what effective curation looks like in practice ^[8]. DC equips a language model with a persistent, self-curating external memory. After each query, the system explicitly decides which problem-solving strategies deserve to be kept, which should be discarded, and which existing entries should be updated. The results are impressive. On math competition problems, one model's accuracy more than doubled (from 23% to 50%) by retaining algebraic insights across problems. On the Game of 24 puzzle, another model went from 10% to 99% by accumulating and reusing solution templates ^[8]. The gains did not come from better prompting or a larger model. They came from the system learning what was worth remembering, and what was not, across successive encounters with similar problems. Meta Context Engineering takes this one step further by having a separate agent optimize the curation procedures themselves, meaning even the format and structure of what gets remembered becomes subject to improvement ^[9].

Hard Constraints for Soft Outputs

Checklists and selective memory both improve iteration quality, but they share a limitation. Both rely on the language model itself, or a similar model, to make evaluative judgments. A model asked to evaluate its own output against a checklist can exhibit the same biases and inconsistencies that it exhibits in generation. For constraints that must hold without exception, a different mechanism is needed, one that removes the model from the decision entirely.

The general principle is to separate what the model does well (natural language understanding, flexible reasoning, tolerant interpretation of ambiguous input) from what it does poorly (logical guarantees, strict constraint enforcement). VERUS-LM demonstrates this by splitting reasoning into two responsibilities ^[10]. The language model translates a task description into a formal representation. A symbolic reasoning engine then performs logically sound inference over that representation. On logical reasoning benchmarks, the advantage of this hybrid approach grew as task complexity increased ^[10]. The model is good at understanding what the problem is. The symbolic engine is good at solving it correctly. Neither works as well alone.

An application of this division of labor uses the Lean 4 theorem prover as a verification layer for financial compliance ^[11]. Every proposed action by the agent is translated into a formal logical proposition and verified by the Lean 4 proof kernel before execution. If the proof does not check, the action does not execute. There is no probability threshold, no confidence score, no "this looks right." A compliance rule under this architecture becomes a constraint enforced with mathematical certainty, independent of whatever the model's next-token distribution might prefer ^[11]. From a systems perspective, this is the kind of guarantee that makes the difference between a prototype and a production deployment in regulated industries.

What This Suggests

The three mechanisms operate at different stages of the refinement cycle and address distinct failure modes. A checklist defines what "better" means for the current turn. Selective memory decides what to carry forward. Deterministic validation enforces constraints that must hold regardless of the model's probabilistic output.

Any one of these in isolation appears to be insufficient. A checklist without selective memory will eventually be overwhelmed by accumulated context. Selective memory without structured criteria risks curating toward the wrong quality dimensions. Deterministic validation without good memory and good criteria will enforce hard constraints on output that is otherwise drifting.

For teams building iterative workflows, whether for code generation, research, writing, or any domain where quality develops through successive passes, the practical takeaway is that the turn is the unit of design. The effort spent deciding what each turn evaluates, remembers, and enforces may matter at least as much as the effort spent on the initial prompt. Whether the structuring of turns will itself be automated, as early work on meta-level skill evolution tentatively suggests ^[9], or whether it will remain a domain where human judgment about priorities and consequences provides durable value, is a question the field has not yet answered.

References

S. R. Javaji et al., "Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting," in Proc. NeurIPS 2025 Workshop on Multi-Turn Interactions with LLMs, 2025, [Online]
N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024, [Online]
L. Mei et al., "A Survey of Context Engineering for Large Language Models," arXiv, 2025, [Online]
J. Cook et al., "TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation," arXiv, 2024, [Online]
K. Zhou and C. Tan, "AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge," arXiv, 2026, [Online]
A. Vasilopoulos, "Codified Context: Infrastructure for AI Agents in a Complex Codebase," arXiv, 2026, [Online]
Q. Zhang et al., "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models," in Proc. International Conference on Learning Representations (ICLR), 2026, [Online]
M. Suzgun et al., "Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory," in Proc. European Chapter of the Association for Computational Linguistics (EACL), 2026, [Online]
H. Ye et al., "Meta Context Engineering via Agentic Skill Evolution," arXiv, 2026, [Online]
B. Callewaert, S. Vandevelde, and J. Vennekens, "VERUS-LM: A Versatile Framework for Combining LLMs with Symbolic Reasoning," arXiv, 2025, [Online]
D. Rashie and V. Rashi, "Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving," arXiv, 2026, [Online]

Context is Infrastructure, Not Instructions

Most teams treat AI context as a runtime concern, something to tune session by session. The teams making the fastest progress treat it as a software dependency, versioned, tested, and governed. The infrastructure patterns for doing this already exist.

The Capability-Reliability Split in Agent Systems

Why frontier agents reach state-of-the-art on one run and fail at the same task on the next, and what evaluation needs to change.

Discuss This with Our AI Experts