Papers
arxiv:2606.13473

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Published on Jun 11
· Submitted by
taesiri
on Jun 12
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

MaxProof is a test-time scaling framework that enhances mathematical proof generation by combining multiple proof-oriented capabilities and using population-level search with tournament selection to achieve competitive performance on high-level mathematical competitions.

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on HF Mirror checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Neat paper. The idea of using a single model as a generator, verifier, refiner, and ranker to handle competition-level math is quite a shift from the usual multi-model setups. Achieving gold-medal level performance on both IMO 2025 and USAMO 2026 via test-time scaling is an impressive jump.

How much does the performance drop when you remove the critique-conditioned repair step compared to just using the generator and verifier alone?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/8145e1f6-9806-4b50-beec-198cf656f46a

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13473
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13473 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13473 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13473 in a Space README.md to link it from this page.

Collections including this paper 1