A team ships a model update. The evaluation suite runs, every check comes back green, the automated quality score clears its threshold. A week later the support queue is filling and the usage curve is bending the wrong way. The usual explanations for this gap are by now well documented, and most of them blame the test. The grader might share the model's blind spots, the problem traced in Grading on a Shared Curve. The single score might hide how much the system wobbles from one run to the next, the subject of The Capability-Reliability Split in Agent Systems. Both are stories about a flawed instrument.
Set them aside for a moment. Suppose the eval is clean, stable, free of contamination, and scored honestly. A green dashboard can still sit above unhappy users for a reason that has nothing to do with the test being broken. The dashboard measures the model, by itself, on a fixed set of inputs, at one point in time. The user lives with the whole system, in back-and-forth, over weeks. The model, the human-AI team, and the running experience are three different things, and an offline eval only ever sees the first. Three fields that have been living this exact gap for years, recommender systems, human-computer interaction, and online experimentation, have both the evidence and the vocabulary worth borrowing.
The Soloist and the Duet
A clean recording of a soloist says very little about how the duet will sound. The thing a user experiences is the duet, the model and the person working through a task together, and that joint performance is what an eval of the model alone leaves out. Bansal and colleagues studied this directly with an AI whose accuracy was comparable to the humans it assisted, asking whether the pair could reach complementary performance, where the team is more accurate than either the person or the model working solo [1]. Adding the AI helped. Adding explanations of the AI's reasoning, the feature most often assumed to help further, did not increase complementarity. Instead, explanations raised the chance that people accepted the AI's recommendation regardless of whether it was correct.
That last detail is the uncomfortable one. A model can post a higher solo score and leave the team no better off, and a change meant to improve the experience can make the team worse by encouraging misplaced trust. The dashboard grades the soloist. The user hears the duet, and the duet has its own failure modes that the solo score cannot register.
Offline Wins Do Not Always Survive Contact
Recommender systems have run this comparison for real, with users instead of held-out data. Rossetti, Stella, and Zanker used a within-users design, the same people in both settings, to compare how algorithms ranked by offline accuracy against how they ranked in an online, user-centric study [2]. The two rankings contradicted each other. The algorithm that looked best on the offline metric was significantly worse online at producing recommendations users actually found useful, meaning both relevant and novel. The authors put it plainly, the external validity of the most common offline evaluation method is not guaranteed.
That is roughly a decade of green dashboards over unhappy users, documented in a field that depends on getting recommendations right. The practical reading is modest and useful. An offline eval is a screen, a cheap way to decide which candidates deserve a more expensive test. It is not the verdict. Treating the offline ranking as the answer is how a system clears every check and still disappoints the people it was built for.
The Proxy You Reach For Online Bites Back
The obvious response is to stop measuring proxies and measure users. The catch is that online measurement is also a proxy. What gets logged is behavior, clicks, session length, a thumbs up, whether someone came back, and behavior is not the same as value. Kleinberg, Mullainathan, and Raghavan modeled what happens when a well-meaning platform optimizes engagement while users hold inconsistent preferences, wanting one thing in the moment and another on reflection [3]. Engagement and genuine utility can pull apart. Their image for it is chips and salad. For some content, the fact that people consume more of it tracks real value, the way the most-watched calculus tutorial may simply be the best one. For other content, consuming more signals a pull people would not endorse on reflection, the way a bag of chips disappears. Users can spend long sessions and get little from them, and a change can lift engagement for a while before they quit abruptly.
Time makes this harder to read. Drawing from online experiments, Kohavi and colleagues catalog why short experiments mislead, including novelty and primacy effects, where a change looks like a win for the first couple of weeks mainly because it is new, then fades once the novelty wears off [4]. Their central advice is to choose an Overall Evaluation Criterion, a metric deliberately tied to long-term value rather than a short-term bump, and to run experiments long enough to see past the early effect. A green two-week A/B test and a satisfied user a quarter later are different claims, and only one of them is on the dashboard.
Some of "Works" Was Never Going Into a Test
Underneath all of this sits a limit that no amount of better instrumentation removes. Ackerman, in a foundational paper on computer-supported work, named the socio-technical gap, the divide between what a system needs to support socially and what it can actually be built to support technically, because human activity is flexible, nuanced, and context-dependent while technical mechanisms stay rigid [5]. For an AI product, part of whether it works for a particular person lives in context an eval cannot enumerate, the user's unstated goal, how much weight they place on a confidently wrong answer, the stakes of the moment, the social setting the output lands in. Some of that is not in the test set, and some of it could not be put there.
An empty diner sharpens the point. The bare booths record that people left, not why they left. Dissatisfied users mostly do not file tickets, they simply stop coming back, so the feedback that would correct the dashboard is filtered by who chose to stay. The dashboard hears from the survivors and infers contentment from their presence, which is exactly the population least likely to report the problem that drove everyone else away.
Measuring the Team and the Experience
None of this argues against evals, and none of it asks for a single new number. It argues for widening what the dashboard is allowed to count as evidence. Scoring the human-AI team rather than the model alone, by tracking whether the pair outperforms either side solo and whether reliance on the model is appropriate rather than automatic, brings the duet back into view [1]. Treating an offline eval as a screen that earns a candidate the right to an online test, and letting the online result rather than the offline ranking make the call, respects what recommender systems learned the hard way [2]. Choosing a long-term evaluation criterion, watching for behavioral proxies that drift away from it, and running experiments past the novelty window keep the online dashboard from telling a flattering short story [3][4]. And instrumenting the interaction itself, recovery after a bad turn, escalation rate, repeat use, numbers broken out by user segment rather than averaged into one cheerful figure, narrows the part of the gap that can be narrowed, while a human channel stays open for the part that cannot [5].
What the Green Actually Certifies
A green offline dashboard certifies something narrow, and stating it plainly is half the cure. The model did well on a fixed set of inputs, by itself, at one moment, measured by a quantity that may or may not be the one users feel. It does not certify that the human-AI team did well, that the result holds up against live traffic, that the proxy being optimized is the thing users actually value, or that the unmeasurable remainder is being handled at all. The distance between scoring a model and serving a person is where the unhappy users are, and that distance does not show up on a dashboard built to watch the model. Closing it starts with measuring the second thing, not assuming the first one stands in for it.
References
- G. Bansal et al., "Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance," in Proc. 2021 CHI Conference on Human Factors in Computing Systems (CHI '21), 2021. DOI: [Online]
- M. Rossetti, F. Stella, and M. Zanker, "Contrasting Offline and Online Results when Evaluating Recommendation Algorithms," in Proc. 10th ACM Conference on Recommender Systems (RecSys '16), 2016, pp. 31–34. DOI: [Online]
- J. Kleinberg, S. Mullainathan, and M. Raghavan, "The Challenge of Understanding What Users Want: Inconsistent Preferences and Engagement Optimization," Management Science, vol. 70, no. 9, pp. 6336–6355, 2024. DOI: [Online]
- R. Kohavi et al., "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained," in Proc. 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12), 2012, pp. 786–794. DOI: [Online]
- M. S. Ackerman, "The Intellectual Challenge of CSCW: The Gap Between Social Requirements and Technical Feasibility," Human-Computer Interaction, vol. 15, no. 2-3, pp. 179–203, 2000. DOI: [Online]
Experts All the Way Down
A small gate that does not just pick an expert but splits the input and asks again, recursively, before any expert answers, is an old idea with a 1994 pedigree. Today's composition-of-experts and hierarchical mixture-of-experts systems borrow pieces of it, though most still dispatch in a single flat step, which leaves the genuinely recursive, category-by-category version more proposed than proven.
Grading on a Shared Curve
Most teams now use one language model to score another, and a judge that agrees with human raters most of the time looks like a cheap substitute for review. The harder question is what that agreement establishes when the judge and the model it scores learned from the same data and tend to fail in the same places.
Discuss This with Our AI Experts
Have questions about implementing these insights? Schedule a consultation to explore how this applies to your business.