Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems
Today, Mistral AI launched Leanstral 1.5. It is a code agent mannequin constructed for Lean 4. The launch targets automated theorem proving and proof engineering. Weights are open below Apache 2.0. A free API endpoint, leanstral-1-5, is now reside.
Leanstral 1.5 updates the sooner Leanstral-2603 mannequin. It belongs to the Mistral Small 4 household.
What is Leanstral 1.5
Leanstral 1.5 is a code agent mannequin for Lean 4, a proof assistant. A proof assistant checks each logical step mechanically. Lean 4 can categorical objects like perfectoid areas and properties of Rust fragments.
The structure is a mixture-of-experts, or MoE. An MoE routes every token to a couple specialised sub-networks. This retains compute low whereas complete capability stays massive. Leanstral makes use of 128 consultants, with 4 energetic per token.
Total measurement is 119B parameters, with 6.5B activated per token. Context size is 256k tokens. Input is multimodal, accepting textual content and picture. Output is textual content solely.
How Mistral Trained Leanstral 1.5
Training runs in three levels. These are mid-training, supervised fine-tuning, then reinforcement studying with CISPO. Two reinforcement-learning environments formed the mannequin’s agentic habits.
In the multiturn surroundings, the mannequin receives a theorem assertion. It should show or disprove it. It submits a proof, then reads Lean compiler suggestions. It refines throughout makes an attempt till it succeeds or exhausts its finances.
In the code agent surroundings, Leanstral works inside a uncooked filesystem. It edits recordsdata, runs bash instructions, and makes use of the Lean language server. That server exposes objectives, errors, and kind data in actual time.
This lets it full partial proofs, construct auxiliary lemmas, and persist by means of context compaction. Compaction compresses earlier context so lengthy duties nonetheless match the window. Correctness is verified by Mistral’s fork of SafeVerify towards goal theorems.
Benchmarks and Performance
Mistral staff reviews that Leanstral 1.5 saturates miniF2F. It reaches 100% on each the validation and take a look at units. It solves 587 of 672 PutnamBench issues.
The mannequin units a brand new state-of-the-art on the FATE-H and FATE-X algebra benchmarks. Mistral lists 87% on FATE-H and 34% on FATE-X. On FLTEval, go@1 rises from 21.9 to twenty-eight.9. Pass@8 rises from 31.9 to 43.2.
FLTEval is constructed from actual pull requests to the Fermat’s Last Theorem repository. On it, Leanstral surpasses Opus 4.6’s 39.6 at one-seventh the associated fee. It additionally widens its lead over open-source fashions three to 10 occasions bigger. Pass@8 means eight makes an attempt are allowed per drawback.
| Benchmark | Leanstral 1.5 | Detail |
|---|---|---|
| miniF2F (val + take a look at) | 100% | Saturated, per Mistral |
| PutnamBench | 587 / 672 | ~$4 per drawback |
| FATE-H | 87% | New state-of-the-art |
| FATE-X | 34% | New state-of-the-art |
| FLTEval go@1 | 28.9 | Up from 21.9 |
| FLTEval go@8 | 43.2 | Beats Opus 4.6’s 39.6 |
On PutnamBench, Leanstral edges Seed-Prover 1.5 excessive by 7 issues. It does so at about $4 per drawback. Mistral estimates Seed-Prover’s excessive setting close to $300 or extra per drawback.
That setting runs a finances of 10 H20-days per drawback. Mistral additionally compares towards Goedel-Architect and AxProverBase. It notes Aleph Prover prices roughly $54 to $68 per drawback.
Test-time scaling is the mannequin’s defining habits. Raising the token finances per try lifts PutnamBench Pass@8. Mistral staff reviews 44 solved at 50k, 244 at 200k, 493 at 1M, and 587 at 4M. The interactive explorer under helps you to scrub throughout that very same curve.
Case Studies and Use Cases
Leanstral educated primarily on arithmetic, however it additionally verifies code. Mistral staff paperwork two case research that matter for engineers.
- First, Leanstral proved O(log n) time complexity for an actual AVL tree implementation. AVL timber are self-balancing binary search timber. The proof used structural induction and monadic time monitoring by way of the TimeM monad. It ran over 2.7 million tokens throughout 22 compactions. It established a sure close to 48 steps per top unit, plus a continuing.
- Second, Leanstral discovered actual bugs in open-source code. An automated pipeline used Aeneas to translate Rust into Lean. Leanstral inferred consumer intent and generated correctness properties. It tried every property in 4 tries, then the negation in 4 extra.
Across 57 repositories, it flagged 47 violated properties and 11 real bugs. Five have been beforehand unreported on GitHub. One bug sat within the signal operate for zigzag decoding in datrs/varinteger. On enter Std.U64.MAX, the expression (worth + 1) overflowed. That brought on crashes in debug mode and silent corruption in launch.
Practical use instances observe straight from these examples. Dev groups can full partial proofs inside a repository. They can generate correctness properties for a operate routinely. They can stress-test Rust code by proving or disproving inferred invariants.
Getting Started: Code and Deployment
The easiest path is Mistral Vibe, Mistral’s agent CLI. Leanstral runs on Mistral’s free plan. Enable ‘Labs fashions’ in your account, then create an API key.
Install Vibe, add the Lean agent, then launch it:
# 1. Set up Mistral Vibe
uv software set up mistral-vibe
uv software replace mistral-vibe
vibe --setup
# 2. Inside vibe, set up Leanstral, then go away vibe
/leanstall
exit
# 3. Launch the Lean agent
vibe --agent lean
For self-hosting, set up vLLM 0.24.0 or newer, then serve the weights:
# Installs mistral_common >= 1.11.5 routinely
uv pip set up -U vllm --torch-backend=auto
vllm serve mistralai/Leanstral-1.5-119B-A6B
--max-model-len 200000
--tensor-parallel-size 4
--attention-backend FLASH_ATTN_MLA
--tool-call-parser mistral
--enable-auto-tool-choice
--reasoning-parser mistral
Call the server by means of the OpenAI-compatible consumer. Set reasoning_effort to excessive for complicated prompts, or none for pace:
from openai import OpenAI
# Point the OpenAI consumer at your vLLM server
consumer = OpenAI(api_key="EMPTY", base_url="<your-host-url>")
TEMP = 1.0
MAX_TOK = 32000
REASONING = "excessive" # change to 'none' for sooner solutions
mannequin = consumer.fashions.record().information[0].id
messages = [
{"role": "user", "content": [
{"type": "text", "text": "Define the transition rules as an inductive proposition in Lean 4."}
]},
]
response = consumer.chat.completions.create(
mannequin=mannequin,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
reasoning_effort=REASONING,
)
print(response.selections[0].message.content material)
print(response.selections[0].message.reasoning)
Leanstral additionally helps OpenAI-style software calling. You can expose a operate akin to lean_run_code to compile snippets. Mistral additional recommends the lean-lsp-mcp server for tighter Lean integration.
Key Takeaways
- Leanstral 1.5 is a free, Apache-2.0 Lean 4 proof-engineering mannequin.
- It makes use of a 119B mixture-of-experts with 6.5B energetic parameters.
- It saturates miniF2F and solves 587 of 672 PutnamBench issues.
- It discovered 5 beforehand unreported bugs throughout open-source repositories.
- Access it by way of Hugging Face weights, a free API, or native vLLM.
Check out the Mistral AI announcement, Leanstral 1.5 model card, and the Hugging Face.. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish (*4*) appeared first on MarkTechPost.
