|

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Today, Mistral AI launched Leanstral 1.5. It is a code agent mannequin constructed for Lean 4. The launch targets automated theorem proving and proof engineering. Weights are open below Apache 2.0. A free API endpoint, leanstral-1-5, is now reside.

Leanstral 1.5 updates the sooner Leanstral-2603 mannequin. It belongs to the Mistral Small 4 household.

What is Leanstral 1.5

Leanstral 1.5 is a code agent mannequin for Lean 4, a proof assistant. A proof assistant checks each logical step mechanically. Lean 4 can categorical objects like perfectoid areas and properties of Rust fragments.

The structure is a mixture-of-experts, or MoE. An MoE routes every token to a couple specialised sub-networks. This retains compute low whereas complete capability stays massive. Leanstral makes use of 128 consultants, with 4 energetic per token.

Total measurement is 119B parameters, with 6.5B activated per token. Context size is 256k tokens. Input is multimodal, accepting textual content and picture. Output is textual content solely.

How Mistral Trained Leanstral 1.5

Training runs in three levels. These are mid-training, supervised fine-tuning, then reinforcement studying with CISPO. Two reinforcement-learning environments formed the mannequin’s agentic habits.

In the multiturn surroundings, the mannequin receives a theorem assertion. It should show or disprove it. It submits a proof, then reads Lean compiler suggestions. It refines throughout makes an attempt till it succeeds or exhausts its finances.

In the code agent surroundings, Leanstral works inside a uncooked filesystem. It edits recordsdata, runs bash instructions, and makes use of the Lean language server. That server exposes objectives, errors, and kind data in actual time.

This lets it full partial proofs, construct auxiliary lemmas, and persist by means of context compaction. Compaction compresses earlier context so lengthy duties nonetheless match the window. Correctness is verified by Mistral’s fork of SafeVerify towards goal theorems.

Benchmarks and Performance

Mistral staff reviews that Leanstral 1.5 saturates miniF2F. It reaches 100% on each the validation and take a look at units. It solves 587 of 672 PutnamBench issues.

The mannequin units a brand new state-of-the-art on the FATE-H and FATE-X algebra benchmarks. Mistral lists 87% on FATE-H and 34% on FATE-X. On FLTEval, go@1 rises from 21.9 to twenty-eight.9. Pass@8 rises from 31.9 to 43.2.

FLTEval is constructed from actual pull requests to the Fermat’s Last Theorem repository. On it, Leanstral surpasses Opus 4.6’s 39.6 at one-seventh the associated fee. It additionally widens its lead over open-source fashions three to 10 occasions bigger. Pass@8 means eight makes an attempt are allowed per drawback.

Benchmark Leanstral 1.5 Detail
miniF2F (val + take a look at) 100% Saturated, per Mistral
PutnamBench 587 / 672 ~$4 per drawback
FATE-H 87% New state-of-the-art
FATE-X 34% New state-of-the-art
FLTEval go@1 28.9 Up from 21.9
FLTEval go@8 43.2 Beats Opus 4.6’s 39.6

On PutnamBench, Leanstral edges Seed-Prover 1.5 excessive by 7 issues. It does so at about $4 per drawback. Mistral estimates Seed-Prover’s excessive setting close to $300 or extra per drawback.

That setting runs a finances of 10 H20-days per drawback. Mistral additionally compares towards Goedel-Architect and AxProverBase. It notes Aleph Prover prices roughly $54 to $68 per drawback.

Test-time scaling is the mannequin’s defining habits. Raising the token finances per try lifts PutnamBench Pass@8. Mistral staff reviews 44 solved at 50k, 244 at 200k, 493 at 1M, and 587 at 4M. The interactive explorer under helps you to scrub throughout that very same curve.