Guess First, Check Later

Speculative Execution as an Architectural Pattern Across Serving, Reasoning, and Agents

Close-up of a circuit board with intricate interconnected pathways, analogous to the processor technique of speculative execution that language model serving has borrowed

A processor that reaches a branch in a program does not wait to learn which way the branch goes. It predicts the likely path, runs ahead on that guess, and keeps the work if the guess was right or discards it if the guess was wrong. The technique is called speculative execution, and it has been part of computer architecture for decades. It pays off because speculating is cheap relative to waiting, and checking a guess is cheap relative to producing the answer from scratch. That asymmetry in cost is the whole reason the trick is worth the bookkeeping.

Large language models turn out to have the same asymmetry in several places, which may be why the same idea keeps getting rediscovered under different names. A pattern worth naming is one that shows up in more than one corner of a system, and the draft-then-verify shape now appears in token generation, code production, retrieval, and agent planning. Treating these as one pattern, rather than four unrelated tricks, suggests where the shared design problems and shared mistakes are likely to live.

One Pattern, Several Names

The clearest instance is speculative decoding, introduced for transformer inference in 2022. The observation behind it is that generating text one token at a time is slow not because the arithmetic is heavy but because the hardware spends most of its time moving the model's parameters from memory for each single token. A small, fast draft model proposes several tokens. The large target model then checks all of those proposed tokens in a single pass, which costs about the same as producing one token on its own, and a sampling rule accepts the longest prefix that matches what the large model would have produced anyway. The work demonstrated a two to three times speedup on a large model with identical outputs, no retraining, and no architecture change ^[1]. A parallel effort at DeepMind arrived at the same core method independently, reporting roughly a two to two and a half times speedup on a 70 billion parameter model while preserving the target model's output distribution exactly through a modified rejection sampling scheme ^[2].

A 2024 survey of the area makes the lineage explicit. It describes speculative decoding as an adaptation of speculative execution from computer architecture, the same optimization where tasks are performed in advance and then verified for whether they were needed ^[3]. The survey also names the two design questions that govern whether the pattern helps. The first is how to build a drafter that balances speculation accuracy against drafting cost. The second is whether the verification step can stay parallel while still guaranteeing output quality. Both questions reappear, in slightly different clothing, every other place the pattern shows up.

Two later variants are worth noting because they show how much room the drafter side has. Medusa drops the separate draft model entirely and adds small extra prediction heads to the existing model, which propose several future tokens that a tree-based attention step verifies together, reporting roughly two to three and a half times speedup without a second model to maintain ^[4]. EAGLE moves the drafting down to the model's internal feature representations rather than its output tokens and reports a 2.7 to 3.5 times latency improvement while keeping the generated distribution unchanged ^[5]. The verification half stays constant across these variants. What changes is how cheaply and accurately the guess gets made.

The Same Shape Outside Token Generation

The reason to treat this as a pattern rather than an inference trick is that the draft-then-verify structure is not specific to tokens. It appears wherever cheap generation under uncertainty can be paired with a more trustworthy and relatively cheap check.

Code generation is the most familiar case. A model proposes an implementation, and a deterministic tool decides whether the proposal is acceptable. The tool might be a compiler, a type checker, or a test suite. The generator does not need to be right on the first attempt. It needs to be right often enough that the combined cost of generating and checking beats the cost of a slow, careful, single pass. The verifier here has a quality that the token-level case has to work hard to approximate, which is that a compiler or a passing test is an external and largely objective judgment rather than another opinion from the same family of model.

A metal ruler laid across printed text on a page, analogous to a verifier checking a cheaply produced draft against a fixed and external standard

A ruler laid across a printed page does not write the text. It measures it against a fixed standard, and that division of labor is the heart of the pattern. Retrieval-augmented generation follows the same division. A fast similarity search over a vector index guesses which documents are likely to be relevant, and a reader model checks those candidates and uses the useful ones. The original retrieval-augmented generation work combined a pretrained generator with a dense vector index of Wikipedia accessed by a neural retriever, and found the combination produced more specific and more factual output than the generator alone ^[6]. The retriever is speculating about relevance. The reader corrects that speculation. The asymmetry holds, since the lookup is fast and the reader's pass over a handful of candidates is far cheaper than reasoning without any retrieval at all.

Reasoning shows the pattern too, and it is where the cost balance gets most interesting. An early and influential result on grade-school math problems trained a separate verifier to judge the correctness of candidate solutions, generated many candidates at test time, and selected the one the verifier ranked highest. Verification improved accuracy and scaled better with more data than simply fine-tuning the generator harder ^[7]. Generating several cheap candidate solutions and spending the expensive judgment on selection is the same move as drafting several cheap tokens and spending the expensive forward pass on acceptance.

What Makes the Pattern Pay Off

Every instance of draft-then-verify lives or dies on one number, which is how often the verifier accepts the draft. If the drafter is poorly matched to the verifier, the verifier rejects nearly everything, and the system pays for two models while getting the output of one. The speculative decoding survey frames this as the central tension of drafter design, the trade between how accurate the speculation is and how cheap it is to produce ^[3]. A better drafter raises the acceptance rate but costs more to run, which narrows the very advantage the pattern exists to capture. There is an operating point that depends on the acceptance rate, the cost of the verifier, and the cost ratio between drafter and verifier, and from a systems perspective it is striking how often that operating point is chosen by intuition rather than measured.

The useful consequence of seeing these cases as one pattern is that the calibration lessons transfer. Acceptance rate in speculative decoding, compilation pass rate in code generation, and verifier selection rate in reasoning are the same quantity wearing different labels. A team that has learned how sensitive token-level speedup is to draft-target alignment already knows something about why a code agent that drafts with one model and verifies with mismatched tests will stall. The drafter and the checker have to agree often enough, on the right things, for the arrangement to be worth its overhead.

The Verifier Is the Weak Point

The pattern is only as trustworthy as its verifier, and verifiers are not all equally trustworthy. A deterministic check is the strongest kind. A compiler, a type system, a test suite, or a rejection sampling rule that provably preserves a distribution gives a hard signal that is external to the model doing the guessing. The speculative decoding results are reassuring precisely because their verification step is a mathematical guarantee about the output distribution rather than a judgment call ^[2].

The trouble starts when the verifier is itself a language model. Using a strong model as a judge can approximate human preference well, reaching over 80 percent agreement with human raters in one widely cited study, but the same work documents the failure modes that come with it, including position bias, verbosity bias, and a self-enhancement bias where a model tends to favor outputs that resemble its own ^[8]. A verifier that prefers answers shaped like its own guesses is a weak check on a drafter from the same model family, since the two share blind spots. The risk compounds when the verifier and the drafter are the same model asked to grade itself. Research on reasoning found that models often fail to correct their own answers without external feedback, and that performance sometimes degrades after a self-correction pass ^[9]. The draft-then-verify pattern inherits that finding directly. If the verification step is just the generator in a more skeptical voice, it may not be catching much.

A person reviewing and signing layered documents, analogous to the verifier acting as the gate that decides whether a cheaply produced draft is accepted

A signature on a reviewed document is a gate, and the value of the gate depends entirely on whether the reviewer can actually see the errors. The same is true for any system built on this pattern. The choice of verifier may be the most consequential design decision in the whole arrangement, more consequential than the choice of drafter, because a fast drafter paired with a weak verifier produces fast output that no one should trust, while a modest drafter paired with a hard external check produces output a team can stand behind.

What This Suggests

What to do when a draft is rejected is a third decision, and it is usually made implicitly. The options include discarding and regenerating, regenerating with the rejection as feedback, falling back to the expensive path for that one request, or escalating to a person. Each carries a different cost and quality profile, and each is the kind of routing decision worth logging and tuning rather than hardcoding, in the same way other branch points in a model serving stack get instrumented.

The honest summary is that many teams are already running this pattern in more than one place without recognizing it as one pattern. A serving team tunes speculative decoding acceptance rates. A coding-agent team tunes how often generated code passes its tests. A retrieval team tunes how many candidates the reader has to sift. These are the same problem, which means the calibration tooling, the verifier-quality cautions, and the rejection-handling policies could be shared rather than rebuilt three times. The pattern is simple to state, which is to guess cheaply, check with something more trustworthy, and keep the work only if it survives the check. The engineering judgment lives almost entirely in how cheap the guess really is and how much the check can actually be trusted.

References

Y. Leviathan, M. Kalman, and Y. Matias, "Fast Inference from Transformers via Speculative Decoding," in Proc. International Conference on Machine Learning (ICML), 2023, [Online]
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, "Accelerating Large Language Model Decoding with Speculative Sampling," arXiv, 2023, [Online]
H. Xia et al., "Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding," in Findings of the Association for Computational Linguistics (ACL), 2024, [Online]
T. Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," arXiv, 2024, [Online]
Y. Li, F. Wei, C. Zhang, and H. Zhang, "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty," arXiv, 2024, [Online]
P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. 33rd Int. Conf. Neural Inf. Process. Syst. (NeurIPS), 2020. arXiv preprint DOI: [Online]
K. Cobbe et al., "Training Verifiers to Solve Math Word Problems," arXiv, 2021, [Online]
L. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Proc. NeurIPS Datasets and Benchmarks Track, 2023, [Online]
J. Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet," in Proc. International Conference on Learning Representations (ICLR), 2024, [Online]

Grading on a Shared Curve

Most teams now use one language model to score another, and a judge that agrees with human raters most of the time looks like a cheap substitute for review. The harder question is what that agreement establishes when the judge and the model it scores learned from the same data and tend to fail in the same places.

Training in a World That Does Not Exist Yet

World models are crossing the line from research artifact to production infrastructure. Embodied AI policies are now trained, evaluated, and shipped against simulators that were themselves learned from video, raising fresh questions about versioning, evaluation, and physical fidelity.

Discuss This with Our AI Experts