|

DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model That Scored 118/120 on Putnam 2024

How can an AI system show complicated olympiad degree math issues in clear pure language whereas additionally checking that its personal reasoning is definitely right? DeepSeek AI has launched DeepSeekMath-V2, an open weights massive language mannequin that’s optimized for pure language theorem proving with self verification. The mannequin is constructed on DeepSeek-V3.2-Exp-Base, runs as a 685B parameter combination of consultants, and is out there on Hugging Face beneath an Apache 2.0 license.

In evaluations, DeepSeekMath-V2 reaches gold degree scores on IMO 2025 and CMO 2024, and achieves 118 of 120 factors on Putnam 2024 when used with scaled take a look at time compute.

Why Final Answer Rewards aren’t Enough?

Most latest math reasoning fashions use reinforcement studying that rewards solely the last reply on benchmarks reminiscent of AIME and HMMT. This method pushed fashions from weak baselines to close saturation on quick reply contests in about one yr. (Hugging Face)

However, the DeepSeek analysis staff factors out two structural issues:

  1. An accurate numeric reply doesn’t assure right reasoning. The mannequin could attain the proper quantity via algebraic errors that cancel out.
  2. Many duties, reminiscent of olympiad proofs and theorem proving, require a whole argument in pure language. These duties don’t have a single last numeric reply, so customary reply primarily based rewards don’t apply.

DeepSeekMath-V2 subsequently optimizes proof high quality as an alternative of pure reply accuracy. The system evaluates whether or not a proof is full and logically sound, and makes use of that analysis as the principle studying sign.

Training a Verifier earlier than the Generator

The core design is verifier first. DeepSeek analysis staff trains an LLM primarily based verifier that may learn an issue and a candidate proof, then output each a pure language evaluation and a discrete high quality rating within the set {0, 0.5, 1}.

The preliminary reinforcement studying knowledge comes from Art of Problem Solving contests. The analysis staff crawl 17,503 proof fashion issues from olympiads, staff choice checks, and put up 2010 issues that explicitly require proofs. These issues type the bottom set for chilly begin RL. Candidate proofs come from a DeepSeek-V3.2 reasoning mannequin that’s prompted to iteratively refine its personal options, which will increase element but in addition creates many imperfect proofs. Human consultants label these proofs utilizing the 0, 0.5, 1 rubric, primarily based on rigor and completeness.

The verifier is educated with Group Relative Policy Optimization (GRPO). The reward has two parts:

  • A format reward, which checks that the verifier output follows a set template, together with an evaluation part and a last rating in a field.
  • A rating reward, which penalizes absolutely the distinction between the expected rating and the professional rating.

This stage produces a verifier that may grade olympiad fashion proofs in a constant manner.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/major/DeepSeekMath_V2.pdf

Meta Verification to Control Hallucinated Critiques

A verifier can nonetheless recreation the reward. It can output the right last rating whereas inventing pretend points within the evaluation. This would fulfill the numeric goal however make the reasons unreliable.

To handle this, the analysis staff introduce a meta verifier. The meta verifier reads the unique drawback, the proof, and the verifier evaluation, after which evaluates whether or not the evaluation is devoted. It scores elements reminiscent of restatement of steps, identification of actual defects, and consistency between the narrative and the ultimate rating.

The meta verifier can be educated with GRPO, with its personal format and rating rewards. Its output, a meta high quality rating, is then used as an additional reward time period for the bottom verifier. Analyses that hallucinate issues get low meta scores, even when the ultimate proof rating is right. In experiments, this raises the typical meta evaluated high quality of analyses from round 0.85 to 0.96 on a validation cut up, whereas protecting proof rating accuracy steady.

Self Verifying Proof Generator and Sequential Refinement

Once the verifier is powerful, DeepSeek analysis staff trains the proof generator. The generator takes an issue and outputs each an answer and a self evaluation that follows the identical rubric because the verifier.

The reward for the generator combines three alerts:

  1. The verifier rating on the generated proof.
  2. The settlement between the self reported rating and the verifier rating.
  3. The meta verification rating of the self evaluation.

Formally, the principle reward makes use of weights α = 0.76 for the proof rating and β = 0.24 for the self evaluation part, multiplied by a format time period that enforces the output construction. This pushes the generator to write down proofs that the verifier accepts, and to be trustworthy about remaining points. If it claims {that a} flawed proof is ideal, it loses reward via disagreement and low meta scores.

DeepSeek additionally exploits the 128K token context restrict of the bottom mannequin. For exhausting issues, the generator usually can’t restore all points in a single go, as a result of the refined proof plus evaluation would exceed context. In that case, the system runs sequential refinement. It generates a proof and self evaluation, feeds them again as context, and asks the mannequin to supply a brand new proof that fixes the beforehand detected points. This loop can repeat a number of instances, topic to the context funds.

https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/major

Scaling Verification and Auto Labeling

As the generator improves, it produces tougher proofs, that are pricey to label by hand. To preserve coaching knowledge contemporary, the analysis staff introduces an computerized labeling pipeline primarily based on scaled verification.

For every candidate proof, the system samples a number of impartial verifier analyses, then evaluates every evaluation utilizing the meta verifier. If a number of prime quality analyses converge on the identical critical points, the proof is labeled as incorrect. If no legitimate points survive meta checking, the proof is labeled as right. In the ultimate coaching iterations this pipeline replaces human labels, with spot checks confirming good settlement with consultants.

Competition and Benchmark Results

The analysis staff evaluated DeepSeekMath-V2 on a number of fronts:

On an inside set of 91 CNML degree issues protecting algebra, geometry, quantity concept, combinatorics, and inequalities, it exhibits that DeepSeekMath-V2 achieves the best imply proof rating amongst Gemini 2.5 Pro, GPT 5 Thinking High, and DeepSeekMath-V2 in each class, as measured by their verifier.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/major/DeepSeekMath_V2.pdf

On IMO Shortlist 2024, sequential refinement with self verification improves each go at 1 and better of 32 high quality metrics as the utmost variety of refinement iterations will increase.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/major/DeepSeekMath_V2.pdf

On IMO ProofBench, professional analysis the above determine exhibits that DeepSeekMath-V2 outperforms DeepMind DeepThink IMO Gold on the Basic subset and stays aggressive on the Advanced subset, whereas clearly beating different massive fashions.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/major/DeepSeekMath_V2.pdf

For full competitions, it studies:

  • IMO 2025: 5 of 6 issues solved, gold medal degree.
  • CMO 2024: 4 issues absolutely solved plus partial credit score on 1 extra, gold medal degree.
  • Putnam 2024: 11 of 12 issues solved utterly and the remaining drawback with minor errors, for 118 of 120 factors, above the very best human rating of 90.

Key Takeaways

  • DeepSeekMath V2 is a 685B parameter mannequin constructed on DeepSeek V3.2 Exp Base, designed for pure language theorem proving with self verification, and launched as open weights beneath the Apache 2.0 license.
  • The major innovation is a verifier first coaching pipeline with a GRPO educated verifier and meta verifier that rating proofs on rigor, not solely last solutions, which instantly addresses the hole between right solutions and proper reasoning.
  • A proof generator is then educated towards this verifier and meta verifier, utilizing rewards that mix proof high quality, settlement with self analysis, and evaluation faithfulness, plus sequential refinement beneath 128K context to iteratively restore proofs.
  • With scaled take a look at time compute and huge verification budgets, DeepSeekMath V2 reaches gold degree efficiency on IMO 2025 and CMO 2024 and scores 118 of 120 on Putnam 2024, surpassing the very best human rating that yr.

Editorial Notes

DeepSeekMath-V2 is a vital step towards self verifiable mathematical reasoning, as a result of it instantly tackles the hole between right last solutions and proper reasoning, utilizing a verifier, meta verifier and proof generator educated with GRPO on olympiad fashion proofs and deployed at 685B scale to succeed in gold degree efficiency on IMO 2025, CMO 2024 and a close to excellent 118 of 120 rating on Putnam 2024. Overall, this launch exhibits that self verifiable mathematical reasoning with open weights is now virtually achievable for competitors degree issues.


Check out the Full Paper, Model Weights on HF and Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model That Scored 118/120 on Putnam 2024 appeared first on MarkTechPost.

Similar Posts