Benchmarking underpins nearly every claim of progress in large language models. Yet frontier models now acquire new capabilities faster than humans can author verified, discriminative problems. In a few short years, mathematical evaluation climbed from grade-school arithmetic to research-level questions. Each generation took more expertise to create, more effort to verify, and saturated sooner than the last.
We call the point where humans can no longer reliably generate frontier questions, supply trusted answers, evaluate complex solutions, or even assign meaningful difficulty the post-comprehension regime. If benchmarking becomes infeasible there, our ability to measure progress at all is at stake.
Confirming that a fifty-page proof is entirely correct may be infeasible. But checking whether one specific alleged error is real remains tractable, because it requires only a local excerpt rather than the whole argument. We build evaluation on that asymmetry.
An answer is critique-resilient if no adversary can produce a verified witness of failure within a fixed budget. Correctness is redefined as resistance to falsification rather than agreement with a trusted reference. Humans remain essential, but in a narrower role: as bounded verifiers who adjudicate specific claims, such as “does this counterexample actually refute the lemma?”, rather than certifying entire solutions.
A benchmarker authors a question and must answer it too, so that feasibility gating can filter ill-posed or unsolvable tasks. An answerer responds. The benchmarker then hunts for a flaw, and any localized claim of error is sent to adjudication, where a panel of models, escalating to a human on disagreement, rules on that single claim.
Because the benchmarker both poses questions and surfaces errors, benchmark design itself becomes a measurable capability, subject to explicit guardrails that keep questions well-posed.
An itemized bipartite Bradley-Terry model turns win/loss episodes into two scores per model: answerer strength, which captures producing critique-resilient solutions, and benchmarker strength, which captures posing hard-but-solvable questions and catching errors. No human difficulty labels are required; difficulty is inferred from outcomes.
Answerer strength behaves like a general capability measure. Benchmarker strength is meaningful only inside the protocol: a model that poses harder questions is not necessarily a better model in any broader sense.
| Benchmark | Spearman ρ | Kendall τ |
|---|---|---|
| AIME 2025 | 0.851 | 0.706 |
| BRUMO 2025 | 0.830 | 0.673 |
| HMMT Feb 2025 | 0.819 | 0.662 |
Answerer scores are stable under resampling (0.922 internal 5-fold accuracy) and correlate with established human-designed benchmarks.
The most striking result concerns the verifier. When weak models such as GPT-3.5 and GPT-4o, far less capable than the systems they evaluate, replace humans as final adjudicators, the rankings barely change, with rank correlations of ρ between 0.98 and 1. Within the protocol, recognizing a specific error is far easier than producing a correct solution, which lets bounded verification scale as the gap between evaluator and evaluated widens.
The informal implication is: if GPT-3.5 can correctly benchmark GPT-5.2, human mathematicians can benchmark next-generation models.
Each model authors one question per mathematical area. Some examples:
We're building a larger benchmark for mathematics, based on CRB. Have a model you'd like to contribute? Email us at [email protected].
@inproceedings{marro2026crb,
title = {Benchmarking at the Edge of Comprehension},
author = {Marro, Samuele and Yu, Jialin and La Malfa, Emanuele and
Deb, Oishi and Li, Jiawei and Yang, Yibo and Abraham, Ebey and
Sengupta, Sunando and Sommerlade, Eric and Wooldridge, Michael
and Torr, Philip},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}