OpenAI Succeeds at Research-Level Mathematical Proof: AI Enters 'Problem-Solver' Territory
Five of Ten Problems in 'First Proof' Challenge Likely Correct — Long-Form Reasoning and Ambiguity Handling Verified

Artificial intelligence is evolving beyond simple computation and problem-solving to challenge research-level mathematical problems.

OpenAI has presented new advances in AI reasoning capabilities by releasing the results of its internal model's proof attempts on the high-difficulty mathematics challenge "First Proof."

"First Proof" consists of research-level problems that even domain experts need time to verify, unlike competition problems with clearly defined answers. Some problems have remained unsolved for years, and are designed at a difficulty level that takes more than a week even with department-level resources. This challenge is significant in that it tested core competencies required in actual research processes — including sustained long-form reasoning, abstraction selection, and ambiguity handling — rather than simply competing on benchmark scores.

OpenAI attempted proofs for a total of 10 problems, and expert feedback indicated that the proofs for at least 5 problems (4, 5, 6, 9, and 10) are likely correct. The initial model solved only 2 problems, but through iterative training and strategy improvement, expanded the scope of solutions to 5. Particularly noteworthy is that the model maintained rigorous logic and completed proofs on problems with familiar mathematical structures.

However, limitations were also revealed. Some problems initially evaluated as correct were found to have errors during community analysis and official review. This shows that even mathematical proofs generated by AI still require the role of human experts for final rigorous verification.

This achievement was derived through a collaborative model with human oversight, not fully automated proofs. The research team guided the AI to retry promising reasoning paths through strategic prompting, and had it expand or revise arguments by incorporating expert feedback. Internal reasoning models and conversational models also cross-verified each other to refine logical structure and expression. This suggests that AI is evolving not into an independent researcher, but into a "research partner" collaborating with humans.

OpenAI explained that this achievement is not a one-off experiment, but is on the continuum of long-accumulated reasoning research. It was based on achievements solving International Mathematical Olympiad (IMO)-level problems in 2025 and scientific research cases using next-generation models. Recently, the company disclosed that internal models are in the process of proving physics formula candidates proposed by GPT-5.2 series models, with researchers verifying these results.

The future goal is to maintain continuous reasoning for hours or more while securing a high level of rigorous confidence. The research team stated plans to gradually integrate this "research-level reasoning capability" into next-generation public models.

This announcement shows the possibility of AI establishing itself as a partner in complex scientific and mathematical research, beyond being a simple information generation tool. However, it has also become clear that new challenges such as rigorous verification, accountability, and research ethics standards must be addressed in parallel. AI's participation in research is now moving from the stage of possibility to the stage of actual application.