CMU Researchers Introduce PPP and UserVille To Train Proactive And Personalized LLM Agents
Most LLM brokers are tuned to maximise process success. They resolve GitHub points or reply deep analysis queries, however they don’t purpose rigorously about when to ask the person questions or learn how to respect totally different interplay preferences. How can we design LLM brokers that know when to ask higher questions and adapt their conduct to every particular person person?
A group of researchers from Carnegie Mellon University CMU and OpenHands formalizes these lacking behaviors as 3 joint aims, Productivity, Proactivity, and Personalization, and optimizes them with a multi goal reinforcement studying framework known as PPP inside a brand new setting named UserVille.

From process success to interplay conscious brokers
The analysis group defines:
- Productivity as process completion high quality, for instance F1 on SWE-Bench Verified operate localization or precise match on BrowseComp-Plus.
- Proactivity as asking important clarifying questions when the preliminary immediate is obscure whereas avoiding pointless queries.
- Personalization as following person particular interplay preferences akin to brevity, format, or language.
UserVille, an interactive setting with desire conscious simulators
UserVille converts present agent benchmarks into an interplay centric RL setting populated by LLM based mostly person simulators.
It has 3 levels:
- Prompt Vaguenization: Precise process prompts are rewritten into obscure prompts that hold the identical intent however take away particulars. This creates info asymmetry, the simulator nonetheless observes the exact immediate, the agent solely sees the obscure model.
- Preference Aware User Simulation: Each person simulator is parameterized by a desire from a pool of 20 sorts. Preferences cowl brevity, variety of questions per flip, reply format, timing, language constraints, or necessities akin to JSON formatted questions. Twelve preferences are utilized in coaching and 8 preferences are held out for generalization assessments.
- User Centric Evaluation: After the duty, the simulator labels every query as low effort, medium effort, or excessive effort based mostly on whether or not it will possibly reply utilizing the exact immediate and how exhausting it’s to reply. Proactivity rating is 1 if the general session is low effort, in any other case 0. Personalization rating is 1 if the agent follows the desire, in any other case 0, averaged over classes the place the agent requested at the very least 1 query.
UserVille is instantiated on 2 domains, software program engineering with SWE-Gym for coaching and SWE-Bench Verified and SWE-Bench Full for analysis, and deep analysis with BrowseComp-Plus and a search plus open_page software scaffold.

PPP, multi goal RL for productive, proactive, and personalised brokers
Agents are carried out as ReAct fashion software utilizing insurance policies based mostly on Seed-OSS-36B-Instruct. They can name area instruments and an ask_user software that queries the person simulator.
PPP defines a trajectory degree reward
R = RProd + RProact + RPers.
- Productivity reward RProd is the duty metric, F1 on SWE-Func-Loc or precise match on BrowseComp-Plus.
- Proactivity reward RProact provides a bonus of +0.05 if all questions within the session are low effort and applies penalties of −0.1 for every medium effort query and −0.5 for every excessive effort query.
- Personalization reward RPers provides +0.05 when the agent follows the desire and provides non optimistic penalties outlined by the desire particular rule for every violation.
Training makes use of a GRPO based mostly RL algorithm with the Clip Higher technique and token degree coverage gradient loss from DAPO, and solely optimizes LLM generated tokens. The coaching setting is carried out with Verl. Seed-OSS-36B-Instruct is skilled for 200 steps with batch measurement 64 and group measurement 8. Maximum output lengths are 32k tokens for SWE-Func-Loc, 65k for SWE-Full, and 41k for deep analysis. GPT 5 Nano is used because the person simulator. SWE scaffolds are based mostly on OpenHands, and deep analysis makes use of a search software and an open_page software with Qwen3-Embed-8B as retriever.

Experimental outcomes
The table-2 (beneath picture) evaluates productiveness, proactivity, and personalization on SWE-Bench Verified Func-Loc and BrowseComp-Plus, utilizing obscure prompts and averaging over 20 preferences.

For the Seed-OSS-36B-Instruct base mannequin:
- on SWE-Func-Loc, productiveness 38.59, proactivity 43.70, personalization 69.07
- on BrowseComp-Plus, productiveness 18.20, proactivity 37.60, personalization 64.76.
After PPP RL coaching, the PPP mannequin reaches:
- on SWE-Func-Loc, productiveness 56.26, proactivity 75.55, personalization 89.26
- on BrowseComp-Plus, productiveness 26.63, proactivity 47.69, personalization 76.85.
The common achieve throughout all 3 dimensions and each datasets is 16.72 factors relative to Seed-OSS-36B-Instruct and PPP additionally outperforms GPT 5 and different GPT collection baselines on the mixed metric.
Interaction is essential for obscure prompts. On SWE-Func-Loc, F1 with exact prompts and no interplay is 64.50. With obscure prompts and no interplay it drops to 44.11. Adding interplay with out RL doesn’t get better this hole. With PPP coaching and interplay, F1 below obscure prompts improves by 21.66 factors.
PPP additionally modifications interplay conduct. The ask ratio on SWE-Func-Loc rises from 50 % to one hundred pc below obscure prompts and from 51 % to 85 % on deep analysis, whereas remaining low for exact prompts. The variety of questions per session will increase early in coaching, then stabilizes with a excessive proportion of low effort questions and only a few excessive effort questions.
Key Takeaways
- PPP frames agent coaching as a multi goal RL downside that collectively optimizes Productivity, Proactivity, and Personalization, as a substitute of focusing solely on process success.
- UserVille builds obscure immediate variations of present benchmarks and pairs them with desire conscious person simulators, which implement 20 distinct interplay preferences and label person effort ranges.
- The complete reward combines process metric, person effort, and desire adherence, utilizing bonuses for low effort questions and penalties for medium and excessive effort or desire violations, carried out with a GRPO based mostly RL algorithm.
- On SWE Bench Func Loc and BrowseComp Plus with obscure prompts, PPP skilled Seed OSS 36B considerably improves all 3 metrics over the bottom mannequin and over GPT 5 baselines, with a median achieve of about 16.72 factors throughout dimensions and datasets.
- PPP brokers generalize to unseen preferences, alternate simulators, and more durable duties akin to SWE Bench Full, and they study to ask fewer however extra focused low effort questions, particularly when prompts are obscure.
Editorial Comments
PPP and UserVille mark an essential step towards interplay conscious LLM brokers, since they explicitly encode Productivity, Proactivity, and Personalization within the reward design, use desire conscious person simulators that implement 20 interplay preferences, and apply GRPO with DAPO fashion token degree optimization inside Verl and OpenHands scaffolds. The enhancements on SWE Bench Func Loc, SWE Bench Full, and BrowseComp Plus present that interplay modeling is now a core functionality, not an auxiliary function.
Check out the Paper and Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The submit CMU Researchers Introduce PPP and UserVille To Train Proactive And Personalized LLM Agents appeared first on MarkTechPost.
