ICML 2026 · Oral

Benchmarking at the Edge of Comprehension

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

University of Oxford · Institute for Decentralized AI · Tsinghua University · KAUST · Microsoft

Paper arXiv Code Tweet

Models now solve problems faster than humans can verify them.

As a human, you can't check the whole proof, but you can check a claim about a specific part.

The benchmarker creates a question, the answerer solves it, the benchmarker looks for errors. Claims are sent to the judge.

Success rates determine the model's Elo.

Benchmarks saturate faster than we can build them

Benchmarking underpins nearly every claim of progress in large language models. Yet frontier models now acquire new capabilities faster than humans can author verified, discriminative problems. In a few short years, mathematical evaluation climbed from grade-school arithmetic to research-level questions. Each generation took more expertise to create, more effort to verify, and saturated sooner than the last.

Frontier-model accuracy on successive math benchmarks. Newer suites start harder, yet the time from release to saturation keeps shrinking. Points reflect reported frontier scores; AIME 2025 was effectively solved on arrival.

We call the point where humans can no longer reliably generate frontier questions, supply trusted answers, evaluate complex solutions, or even assign meaningful difficulty the post-comprehension regime. If benchmarking becomes infeasible there, our ability to measure progress at all is at stake.

Falsification is cheaper than verification

Confirming that a fifty-page proof is entirely correct may be infeasible. But checking whether one specific alleged error is real remains tractable, because it requires only a local excerpt rather than the whole argument. We build evaluation on that asymmetry.

An answer is critique-resilient if no adversary can produce a verified witness of failure within a fixed budget. Correctness is redefined as resistance to falsification rather than agreement with a trusted reference. Humans remain essential, but in a narrower role: as bounded verifiers who adjudicate specific claims, such as “does this counterexample actually refute the lemma?”, rather than certifying entire solutions.

Evaluation as an adversarial game

A benchmarker authors a question and must answer it too, so that feasibility gating can filter ill-posed or unsolvable tasks. An answerer responds. The benchmarker then hunts for a flaw, and any localized claim of error is sent to adjudication, where a panel of models, escalating to a human on disagreement, rules on that single claim.

One episode of the protocol: a question is posed, an answer is given, a specific error is marked on it, and the resulting claim is adjudicated as upheld or rejected.

Because the benchmarker both poses questions and surfaces errors, benchmark design itself becomes a measurable capability, subject to explicit guardrails that keep questions well-posed.

Two capabilities, measured from outcomes

An itemized bipartite Bradley-Terry model turns win/loss episodes into two scores per model: answerer strength, which captures producing critique-resilient solutions, and benchmarker strength, which captures posing hard-but-solvable questions and catching errors. No human difficulty labels are required; difficulty is inferred from outcomes.

Each model placed by Benchmarker Elo (α) and Answerer Elo (β). Hover to read a model; click to pin its label. The two are correlated but distinct (Spearman ρ = 0.69).

Answerer strength behaves like a general capability measure. Benchmarker strength is meaningful only inside the protocol: a model that poses harder questions is not necessarily a better model in any broader sense.

Stable, externally valid, robust to weak judges

Answerer Elo vs. external benchmarks
Benchmark	Spearman ρ	Kendall τ
AIME 2025	0.851	0.706
BRUMO 2025	0.830	0.673
HMMT Feb 2025	0.819	0.662

Rank correlation between answerer Elo and mean score on each suite (bootstrapped).

Answerer scores are stable under resampling (0.922 internal 5-fold accuracy) and correlate with established human-designed benchmarks.

The most striking result concerns the verifier. When weak models such as GPT-3.5 and GPT-4o, far less capable than the systems they evaluate, replace humans as final adjudicators, the rankings barely change, with rank correlations of ρ between 0.98 and 1. Within the protocol, recognizing a specific error is far easier than producing a correct solution, which lets bounded verification scale as the gap between evaluator and evaluated widens.

The informal implication is: if GPT-3.5 can correctly benchmark GPT-5.2, human mathematicians can benchmark next-generation models.

BibTeX

@inproceedings{marro2026crb,
  title     = {Benchmarking at the Edge of Comprehension},
  author    = {Marro, Samuele and Yu, Jialin and La Malfa, Emanuele and
               Deb, Oishi and Li, Jiawei and Yang, Yibo and Abraham, Ebey and
               Sengupta, Sunando and Sommerlade, Eric and Wooldridge, Michael
               and Torr, Philip},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}