Mark Williams
Mark Williams
Jun 4, 2026
AI Evaluation
An antique brass balance scale in a museum display case, analogous to two measurement instruments calibrated against the same reference weight, where they will agree with each other while both being wrong by the same amount

Two scales calibrated against the same reference weight will agree with each other almost perfectly. They will also be wrong by the same amount, in the same direction, every time. That agreement is not evidence of accuracy. It is evidence of shared calibration. Something structurally similar happens when one language model is used to grade another. The judge and the model under evaluation are usually trained on overlapping slices of the same public text, share architecture lineage and training objectives, and often descend from the same base model. The arrangement is now standard practice. A second model reads the first model's output and scores it for quality, relevance, safety, or correctness, and that score gates a release or fills a dashboard. The appeal is plain, since human review is slow and expensive, and a model that agrees with human raters most of the time reads like a cheap stand-in. One widely cited study found that a strong model used as a judge reaches over 80 percent agreement with human annotators, roughly the rate at which humans agree with one another [1]. Surveys now treat the practice, usually called LLM-as-a-judge, as a routine evaluation tool rather than an experiment [2].

The question worth sitting with is what that agreement number actually establishes. When two systems learn from the same data, their errors are not independent. They tend to break in the same places, on the same kinds of inputs, for the same reasons. An evaluation built on that arrangement measures something real, but it may be measuring how well two correlated models agree rather than the quality a user would perceive. A judge that shares a blind spot with the model it scores cannot see into that blind spot any better than the model can.

What Agreement Does and Does Not Establish

High agreement with human raters is necessary for a judge to be useful, but it is not sufficient to trust the judge as a measurement instrument. The same study that reported strong agreement also documented the failure modes that come with model judges, including position bias, where the judge favors whichever answer appears first, verbosity bias, where longer answers score higher regardless of content, and self-enhancement bias, where a model rates outputs that resemble its own more generously than a human would [1]. A judge can hit a high agreement rate on average while still being systematically wrong on the cases that matter most.

A 2025 analysis put a formal boundary around this concern. When the judge is no more accurate than the model it evaluates, no debiasing method can reduce the number of ground truth labels needed by more than half, and a high rate of agreement does not, on its own, limit how far a biased judge can distort a comparison between models [5]. Ground truth here means labels a team actually trusts, typically careful human judgments. The result speaks directly to the case where the goal is to evaluate a model that is as strong as or stronger than the judge, which is exactly the situation when a frontier system is being assessed. The headline agreement statistic and the trustworthiness of the ranking are not the same quantity, and one can be high while the other is low.

Why the Errors Line Up

The reason the errors correlate is that the judge and the evaluated model are drawing on the same priors. Research on self-evaluation found that models such as GPT-4 and Llama 2 can recognize their own outputs out of the box at non-trivial accuracy, and that the strength of a model's preference for its own outputs rises in linear step with how well it can recognize them [3]. A judge that can tell which text it would have written, and rewards that text, is not grading on the merits. It is grading on familiarity.

A close-up of a guitar headstock with tuning pegs, analogous to tuning an instrument against a fixed external pitch, where matching one guitar to another that is itself slightly flat leaves both consonant with each other yet both below true pitch

A tuner does not have an opinion about whether a note sounds pleasant. It reports one fixed reference frequency, and the string is turned until it matches. Tune one guitar to a second guitar that happens to sit slightly flat, and the two will sound in tune with each other while both sit below true pitch. The reference has to come from outside the pair, and that independence is what a model judge lacks. A study probing the mechanism behind self-preference offers a cleaner statement of the problem. Models assign higher scores to text with lower perplexity than human raters do, regardless of who wrote that text [4]. Perplexity is a measure of how surprised a model is by a passage, or put another way, how likely the model would have been to produce it. Low perplexity means the text sits comfortably inside what the model already expects. So a judge favors the answers it finds familiar, and the answers it finds familiar are the ones shaped like its own training distribution, which is the distribution the evaluated model was largely trained on too. The bias is not a quirk of one model recognizing itself. It is a pull toward the shared center of the data both models came from.

There is a related finding that sharpens the worry. Work on reasoning showed that models often fail to correct their own answers without external feedback, and that accuracy sometimes drops after a self-correction pass [7]. Asking a model to grade another model from the same family is close to asking it to self-correct. If the generator missed something because its priors pointed the wrong way, a judge built on the same priors is likely to miss it for the same reason.

Golden Sets Have a Shelf Life

Many teams anchor their evaluation in a golden test set, a fixed collection of inputs paired with answers a human curated and trusts. The assumption is that the set is a stable yardstick. That assumption weakens over time, and data contamination is the main reason. Contamination is the presence of test examples in a model's pre-training data, which lets the model score well by partial recall rather than by the capability the test was meant to probe. A method for detecting it, built on prompting a model to complete withheld portions of known examples, found that GPT-4 had been exposed to several standard datasets, including AG News, WNLI, and XSum, and reported detection accuracy between 92 and 100 percent against expert review [6].

The practical consequence is that a golden set has a shelf life tied to model release cycles. A set that was clean when it was written can quietly become contaminated once it has been published long enough to be swept into the next pre-training run. When the base model version under the judge changes, the evaluator's priors shift with it, and scores on a familiar set drift for reasons that have nothing to do with the system being tested. This is why it helps to hold part of the golden data outside the model's training window, using freshly authored examples that have never been posted publicly, so at least one slice of the evaluation is measuring capability rather than memory.

What a Production Eval Architecture Looks Like

Treating the model judge as one layer rather than the whole evaluation is what keeps these failure modes contained. A workable architecture tends to have three layers, ordered from cheapest and most trustworthy to most expensive and most subjective.

Flour being sifted through a fine sieve into a bowl, analogous to a graded screen where a coarse, cheap pass catches the obvious lumps before anything reaches a finer and more expensive stage

Sifting works because the coarse screen runs first and catches the obvious lumps cheaply, so only what passes through reaches the finer, slower stages. An evaluation harness benefits from the same ordering. The first layer is deterministic. Schema validation, exact-match checks, unit tests, regular expressions, and other hard rules catch a large share of failures at almost no cost, and they carry no correlated-error risk because a passing test is an external fact rather than another model's opinion. The second layer is model-based scoring, reserved for the dimensions that resist hard rules, such as tone, helpfulness, or faithfulness to a source. This is where the safeguards matter. Drawing the judge from a different model family than the one under test reduces shared blind spots, randomizing answer order blunts position bias, and hiding which system produced an output limits self-recognition effects [2]. The third layer is a periodic human-labeled sample, small but regular, that serves as the ground truth the other two layers are calibrated against.

Detecting evaluator drift is the part teams most often skip. A simple practice is to keep a frozen reference set of outputs with settled scores and re-run the judge against it whenever the judge model or its version changes. If the scores move while the outputs have not, the judge has drifted, and any trend measured across that boundary is suspect. The same frozen set surfaces the slow contamination problem, since a judge that suddenly finds familiar examples easier is telling on itself. None of this removes the value of a model judge. The formal result on frontier evaluation is a reminder of the ceiling rather than a reason to abandon the tool, since debiasing against a modest pool of trusted labels still helps, just not without limit [5].

What This Suggests

The evidence points toward a modest reframing rather than a rejection of model judges. An LLM judge is a useful, scalable signal, but it is not the same kind of instrument as a compiler, a passing test, or a held-out human label, because it shares its priors with the thing it measures. The cases where that matters most are the ones a team most wants to get right, the novel inputs, the stronger model, the subtle failure that sits inside a blind spot both models inherited from the same data. Layering the judge behind deterministic checks, anchoring it to a human sample, watching it for drift, and holding some evaluation data outside the training window are not heavy additions. They are what turns a mirror back into a measurement.

References

  1. L. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," in Proc. NeurIPS Datasets and Benchmarks Track, 2023, [Online]
  2. J. Gu et al., "A Survey on LLM-as-a-Judge," arXiv, 2024, [Online]
  3. A. Panickssery, S. R. Bowman, and S. Feng, "LLM Evaluators Recognize and Favor Their Own Generations," in Proc. 38th Conf. Neural Inf. Process. Syst. (NeurIPS), 2024, [Online]
  4. K. Wataoka, T. Takahashi, and R. Ri, "Self-Preference Bias in LLM-as-a-Judge," arXiv, 2024, [Online]
  5. F. Dorner, V. Nastl, and M. Hardt, "Limits to Scalable Evaluation at the Frontier: LLM as Judge Won't Beat Twice the Data," in Proc. International Conference on Learning Representations (ICLR), 2025, [Online]
  6. S. Golchin and M. Surdeanu, "Time Travel in LLMs: Tracing Data Contamination in Large Language Models," in Proc. International Conference on Learning Representations (ICLR), 2024, [Online]
  7. J. Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet," in Proc. International Conference on Learning Representations (ICLR), 2024, [Online]

Discuss This with Our AI Experts

Have questions about implementing these insights? Schedule a consultation to explore how this applies to your business.

Or Send Message