DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking, Tool Discovery, and Action Execution within a Single Reasoning Process
Most agent frameworks nonetheless run a predefined Reason, Act, Observe loop, so the agent can solely use the instruments that are injected within the immediate. This works for small duties, nevertheless it fails when the toolset is giant, when the duty is lengthy, and when the agent should change technique in the course of reasoning. The group from Renmin University of China and Xiaohongshu proposes DeepAgent as an finish to finish deep reasoning agent that retains all of this inside one coherent reasoning course of.

Unified Reasoning With On Demand Tool Discovery
DeepAgent lets the mannequin output 4 motion sorts immediately in textual content, inner thought, instrument search, instrument name, and reminiscence fold. When the agent decides to look, it queries a dense index that comprises instrument descriptions from giant registries, for instance 16,000 plus RapidAPI instruments and 3,912 ToolHop instruments, then it receives solely the highest ranked instruments again in context. This makes instrument entry dynamic, the mannequin doesn’t depend upon a entrance loaded instrument listing, and it stays aligned with actual environments the place instruments change.
Autonomous Memory Folding for Long Horizon Tasks
Long sequences of instrument calls, net outcomes, and code responses will overflow the context. DeepAgent solves this with an autonomous reminiscence folding step. When the mannequin emits the fold token, an auxiliary LLM compresses the complete historical past into three reminiscences, Episodic Memory that data process occasions, Working Memory that data the present sub purpose and current points, and Tool Memory that data instrument names, arguments, and outcomes. These reminiscences are fed again as structured textual content, so the agent continues from a compact however data wealthy state.
ToolPO, Reinforcement Learning for Tool Use
Supervised traces don’t educate strong instrument use, as a result of right instrument calls are solely a few tokens inside a lengthy technology. The analysis group introduce Tool Policy Optimization, ToolPO, to repair this. ToolPO runs rollouts on LLM simulated APIs, so coaching is secure and low-cost, then it attributes reward to the precise instrument name tokens, that is instrument name benefit attribution, and it trains with a clipped PPO type goal. This is how the agent learns not solely to name instruments, but in addition to resolve when to look and when to fold reminiscence.

Benchmarks, Labeled Tools vs Open Set Tools
The analysis group evaluates on 5 normal instrument use benchmarks, ToolBench, API Bank, TMDB, Spotify, ToolHop, and on 4 downstream duties, ALFWorld, WebShop, GAIA, HLE. In the labeled instrument setting, the place each methodology is given the precise instruments it wants, DeepAgent 32B RL with a QwQ 32B spine reviews 69.0 on ToolBench, 75.3 on API Bank, 89.0 on TMDB, 75.4 on Spotify, and 51.3 on ToolHop, which is the strongest 32B stage outcome throughout all 5 datasets. Workflow baselines resembling ReAct and CodeAct can match single datasets, for instance ReAct with sturdy fashions is excessive on TMDB and Spotify, however none of them keep excessive on all 5, so the honest abstract is that DeepAgent is extra uniform, not that others are all the time low.
In the open set retrieval setting, which is the sensible one, DeepAgent should first discover instruments and then name them. Here DeepAgent 32B RL reaches 64.0 on ToolBench and 40.6 on ToolHop, whereas the strongest workflow baselines attain 55.0 on ToolBench and 36.2 on ToolHop, so the tip to finish agent nonetheless holds the lead. The analysis group additionally reveals that autonomous instrument retrieval itself lifts workflow brokers, however DeepAgent good points extra, which confirms that the structure and the coaching are matched to giant toolsets.

Downstream Environments
On ALFWorld, WebShop, GAIA, and HLE, all beneath a 32B reasoning mannequin, DeepAgent reviews 91.8 p.c success on ALFWorld, 34.4 p.c success and 56.3 rating on WebShop, 53.3 on GAIA, and a larger rating than workflow brokers on HLE. These duties are longer and noisier, so the mix of reminiscence folding and ToolPO is the possible supply of the hole.
Key Takeaways
- DeepAgent retains the entire agent loop inside one reasoning stream, the mannequin can assume, search instruments, name them, and proceed, so it’s not restricted to a mounted ReAct type workflow.
- It makes use of dense retrieval over giant instrument registries, 16,000 plus RapidAPI instruments and about 3,900 ToolHop instruments, so instruments do not need to be pre listed within the immediate, they’re found on demand.
- The autonomous reminiscence folding module compresses lengthy interplay histories into episodic, working, and instrument reminiscences, which prevents context overflow and retains lengthy horizon reasoning secure.
- Tool Policy Optimization, ToolPO, trains instrument use finish to finish with simulated APIs and token stage benefit attribution, so the agent learns to problem right instrument calls, not solely to succeed in the ultimate reply.
- On 5 instrument benchmarks and 4 downstream duties, DeepAgent at 32B scale is extra constant than workflow baselines in each labeled instrument and open set settings, particularly on ToolBench and ToolHop the place instrument discovery issues most.

Editorial Comments
DeepAgent is a sensible step towards agent architectures that don’t depend upon mounted instrument prompts, as a result of it unifies autonomous considering, dense instrument retrieval over 16,000 plus RapidAPIs and 3,900 plus ToolHop instruments, structured instrument calling, and reminiscence folding in a single loop. The use of LLM simulated APIs in ToolPO is an engineering alternative, nevertheless it solves the latency and instability drawback that hurts prior instrument brokers. The analysis reveals constant 32B stage good points in each labeled instrument and open set settings, not remoted peaks. This launch makes giant toolspaces really usable for LLM brokers. Overall, DeepAgent confirms that finish to finish instrument brokers with reminiscence and RL are rising because the default sample.
Check out the Paper and GitHub Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking, Tool Discovery, and Action Execution within a Single Reasoning Process appeared first on MarkTechPost.
