|

Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning

Researchers at Stanford University and Lambda Labs, have revealed the research paper for OpenJarvis, an open-source framework that runs inference, brokers, reminiscence, and studying solely on-device.

The open-weight fashions configured via OpenJarvis land inside 3.2 proportion factors of the most effective cloud mannequin on common, at roughly 800× decrease marginal API price per question and roughly 4× decrease latency below the analysis’s benchmark protocol. This analysis work builds on the analysis group’s earlier Intelligence Per Watt study, which reported that native fashions already deal with 88.7% of single-turn chat and reasoning queries at interactive latency, with intelligence effectivity bettering 5.3× from 2023 to 2025.

Model Overview & Access

OpenJarvis will not be a single mannequin. It is a framework that composes any supported mannequin with a configurable agent stack, evaluated throughout 11 native fashions from 4 households.

Property Value
License Apache 2.0
Framework launch March 12, 2026
Paper arXiv:2605.17172 (posted May 16, 2026)
Repository github.com/open-jarvis/OpenJarvis
Stars / forks ~5.4k / ~1.2k (June 2026)
Languages Python (~83%), Rust (~9%), TypeScript (~7%)
Evaluated fashions 11 native fashions throughout 4 households: Qwen3.5, Gemma4, Nemotron, Granite
Cloud baselines Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro
Supported engines Ollama, vLLM, SGLang, llama.cpp, Apple Foundation Models, Exo (amongst others)
Context window Model-dependent
Installation Single command; ~3 minutes on broadband
Hardware Tested on 7 platforms, from Mac Mini M4 to NVIDIA DGX Spark

Architecture: Five Primitives and a Spec

OpenJarvis decomposes a private AI system into 5 typed primitives, composed via a single declarative configuration object known as a spec.

  • Intelligence — the mannequin, weights, technology parameters, and quantization format.
  • Engine — the inference runtime (Ollama, vLLM, SGLang, and so forth.), batching, KV-cache settings, and {hardware} path.
  • Agents — the reasoning loop (ReAct or CodeAct), system prompts, tool-use coverage, and flip limits.
  • Tools & Memory — exterior interfaces, retrieval backends, 25+ knowledge connectors, and 32+ messaging channels, with native MCP assist and interchangeable reminiscence backends.
  • Learning — the optimizer that updates the spec from traces. This slot accepts LoRA, DSPy, GEPA, or LLM-guided spec search.

Each primitive is independently swappable, and a spec serializes all 5 right into a TOML file. Two specs can share the identical agent and software configuration and differ solely in mannequin and engine, so the identical conduct runs on a Mac Mini and a workstation with out rewriting prompts.

LLM-guided spec search is the second contribution. It is a neighborhood–cloud collaboration: a frontier cloud mannequin acts as a trainer at search time, studying traces, diagnosing failure clusters, and proposing edits throughout Intelligence, Engine, Agents, and Tools & Memory. An edit is accepted provided that it improves the goal failure cluster with out inflicting significant regressions elsewhere — the analysis group calls this the gate (default tolerance 1%). The optimized spec then runs solely on-device at inference time, with zero cloud calls. The trainer is used solely at search time; at 100 queries per day, the amortized trainer price falls under $0.001 per question inside six months.

Prior work (GEPA, DSPy, LoRA) optimizes one primitive at a time, and immediate optimizers alone recuperate solely about 5 pp of the cloud–native hole. LLM-guided spec search recovers 13–32 pp as a result of it edits throughout primitives collectively, at 7–11× decrease optimization price than single-primitive baselines. The four-primitive transfer area contributes 5.5–16.5 pp, and the LLM proposer provides about 10 pp on common over an evolutionary search on the similar transfer area.

https://arxiv.org/pdf/2605.17172v1

Capabilities & Performance

OpenJarvis was evaluated throughout 8 benchmarks spanning 508 duties: software calling (ToolCall-15), agentic workflows (PinchBench), coding (StayCodeBench), customer support (τ-Bench V2, τ²-Bench Telecom), basic help (GAIA), and deep analysis (StayResearchBench, DeepResearchBench).

The swap check: Replacing the supposed cloud mannequin with Qwen3.5-9B in current frameworks (OpenClaw, Hermes Agent) drops accuracy by 25–39 pp. With the identical mannequin below an OpenJarvis spec, the residual drop shrinks to five.6–16.5 pp — recovering 56–77% of the portability loss.

The accuracy frontier: The finest single native mannequin, Qwen3.5-122B, reaches 80.3% common accuracy versus Claude Opus 4.6 at 83.5% — a 3.2 pp hole. Local specs match or exceed cloud on 4 of 8 benchmarks: ToolCall-15, PinchBench, StayCodeBench, and τ-Bench V2.

Cost and latency: Local configurations kind the accuracy–effectivity frontier. Qwen3.5-122B delivers its 80.3% at roughly a thousandth of a cent per question, versus $0.009 per question for Claude Opus 4.6 — an roughly 800× marginal API-cost benefit. End-to-end latency drops by roughly 4× on the agentic workloads, although the paper notes single-shot prompts can favor cloud serving.

Search beneficial properties: LLM-guided spec search improves the Qwen3.5-9B scholar to 100% on PinchBench, 83% on StayCodeBench, and 91% on StayResearchBench. Across the total eight-benchmark suite, common beneficial properties per scholar mannequin vary from 13.1 to 31.5 pp. The authors report that these beneficial properties survive their robustness checks (reward-weight variants, search-seed variance, and random restarts).

How to Use it

Installation is one command. On macOS, Linux, or WSL2:

curl -fsSL https://open-jarvis.github.io/OpenJarvis/set up.sh | bash

Windows customers run an equal PowerShell script (irm … | iex). The installer provisions uv, a Python digital atmosphere, Ollama, and a starter mannequin in about three minutes on broadband. A desktop GUI ships as a .dmg, .exe, .deb, .rpm, or .AppImage from the releases web page.

After set up, jarvis begins a chat session. Starter presets cowl widespread workflows:

jarvis init --preset morning-digest-mac    # day by day briefing with TTS
jarvis init --preset deep-research         # multi-hop analysis with citations
jarvis init --preset code-assistant        # agent with code execution and shell entry
jarvis init --preset scheduled-monitor     # stateful agent on a schedule

The framework ships with eight built-in brokers throughout three execution modes — on-demand, scheduled, and steady. It connects to 25+ knowledge sources (Gmail, Calendar, iMessage, Notion, Obsidian, Slack, GitHub, and others) and exposes brokers over 32+ messaging channels (WhatsApp, Telegram, Discord, iMessage, Signal, and others).

Skills will be imported from exterior catalogs — about 150 from Hermes Agent and about 13,700 group abilities from OpenClaw — all following the agentskills.io specification. A jarvis optimize abilities --policy dspy command refines them from native hint historical past.

Marktechpost’s Visual Explainer

OpenJarvis · Stanford

01 / 07

Stanford · Hazy Research + Scaling Intelligence Lab
OpenJarvis

An open-source, local-first framework for private AI brokers that run inference, brokers, reminiscence, and studying solely on-device.

Within 3.2 pp of finest cloud
~800× decrease marginal API price
~4× decrease latency

Apache 2.0  •  arXiv:2605.17172  •  Framework launched March 12, 2026

What it’s

Personal AI that runs on your {hardware}

Most “private” AI nonetheless routes each question via a cloud API. OpenJarvis makes local-first the default and calls the cloud solely when wanted — constructing on the group’s Intelligence Per Watt discovering that native fashions already deal with 88.7% of single-turn queries.

LicenseApache 2.0
Repositorygithub.com/open-jarvis/OpenJarvis
Models11 native fashions · 4 households
Qwen3.5, Gemma4, Nemotron, Granite
EnginesOllama, vLLM, SGLang, llama.cpp, Apple FM, Exo

Architecture

Five primitives, one spec

A private AI system is decomposed into 5 typed, independently swappable primitives, composed via a single declarative spec serialized to transportable TOML.

  • Intelligence — mannequin, weights, technology params, quantization
  • Engine — inference runtime, batching, KV-cache, {hardware} path
  • Agents — reasoning loop (ReAct or CodeAct), prompts, software coverage
  • Tools & Memory — 25+ connectors, 32+ channels, native MCP
  • Learning — optimizer slot: LoRA, DSPy, GEPA, or spec search

Key methodology

LLM-guided spec search

A frontier cloud mannequin acts as a trainer at search time: it reads traces, diagnoses failure clusters, and proposes edits throughout primitives. A gate accepts solely non-regressing edits. The optimized spec then runs solely on-device — zero cloud calls at inference time.

13–32 ppof the cloud–native hole closed
7–11×decrease optimization price vs single-primitive baselines

The four-primitive transfer area provides 5.5–16.5 pp; the LLM proposer provides ~10 pp over evolutionary search on the similar transfer area.

Performance

Close to cloud, far cheaper

3.2 pphole: Qwen3.5-122B 80.3% vs Claude Opus 4.6 83.5%
4 / 8benchmarks the place native matches or beats cloud

  • Matches/exceeds cloud on ToolCall-15, PinchBench, StayCodeBench, τ-Bench V2
  • ~800× decrease marginal API price; ~4× decrease latency (paper’s protocol)
  • Swap check: a 25–39 pp drop shrinks to five.6–16.5 pp below a spec (56–77% recovered)

Developer expertise

From zero to an agent in minutes

One command provisions uv, a Python digital atmosphere, Ollama, and a starter mannequin (~3 minutes on broadband):

curl -fsSL https://open-jarvis.github.io/OpenJarvis/set up.sh | bash
  • 8 built-in brokers throughout on-demand, scheduled, and steady modes
  • 25+ knowledge connectors · 32+ messaging channels
  • Skills through agentskills.io: ~150 from Hermes Agent, ~13,700 from OpenClaw

The backside line

A analysis platform and a manufacturing basis

OpenJarvis trades roughly 3.2 pp of accuracy — the hole concentrating on reasoning- and research-heavy duties — for main price, latency, and privateness beneficial properties. Inference, agent state, and reminiscence keep on-device by development; the cloud trainer is non-compulsory and bounded.

Caveats: outcomes common 5 runs per configuration, use GPT-5-mini as choose, and have been run on a single machine. Apache 2.0 and actively maintained — constructed, within the authors’ phrases, “within the spirit of PyTorch” for native AI.


Marktechpost
AI analysis and developer instruments, decoded for ML engineers — marktechpost.com


Key Takeaways

  • OpenJarvis runs inference, brokers, reminiscence, and studying absolutely on-device, touchdown inside 3.2 pp of the most effective cloud mannequin at ~800× decrease marginal API price and ~4× decrease latency.
  • A typed "spec" decomposes the stack into 5 swappable primitives — Intelligence, Engine, Agents, Tools & Memory, and Learning — serialized to transportable TOML.
  • LLM-guided spec search makes use of a frontier cloud mannequin as a search-time trainer to recuperate 13–32 pp of the cloud–native hole at 7–11× decrease optimization price, then runs domestically with zero cloud calls.
  • Local specs match or exceed cloud on 4 of 8 benchmarks (ToolCall-15, PinchBench, StayCodeBench, τ-Bench V2); the remaining hole concentrates on reasoning- and research-heavy duties.


Check out the Paper and RepoAlso, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Meet OpenJarvis: A Local-First Framework for On-Device Personal AI Agents with Tools, Memory, and Learning appeared first on MarkTechPost.

Similar Posts