arxiv:2606.13473

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Published on Jun 11

· Submitted by

taesiri on Jun 12

MiniMax

Upvote

Authors:

Jiacheng Chen ,

Yanmohan Wang ,

Abstract

MaxProof is a test-time scaling framework that enhances mathematical proof generation by combining multiple proof-oriented capabilities and using population-level search with tournament selection to achieve competitive performance on high-level mathematical competitions.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

View arXiv page View PDF Add to collection

Community

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on HF Mirror checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

noahml

about 15 hours ago

Neat paper. The idea of using a single model as a generator, verifier, refiner, and ranker to handle competition-level math is quite a shift from the usual multi-model setups. Achieving gold-medal level performance on both IMO 2025 and USAMO 2026 via test-time scaling is an impressive jump.

How much does the performance drop when you remove the critique-conditioned repair step compared to just using the generator and verifier alone?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/8145e1f6-9806-4b50-beec-198cf656f46a