Grading on a Shared Curve
Most teams now use one language model to score another, and a judge that agrees with human raters most of the time looks like a cheap substitute for review. The harder question is what that agreement establishes when the judge and the model it scores learned from the same data and tend to fail in the same places.
06/04/2026
































