How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models

In this tutorial, we construct a sophisticated computer-use agent from scratch that can motive, plan, and carry out digital actions utilizing a native open-weight mannequin. We create a miniature simulated desktop, equip it with a device interface, and design an clever agent that can analyze its surroundings, resolve on actions like clicking or typing, and execute them step-by-step. By the tip, we see how the agent interprets objectives comparable to opening emails or taking notes, demonstrating how a native language mannequin can mimic interactive reasoning and process execution. Check out the FULL CODES here.

Copy Code

!pip set up -q transformers speed up sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()

We arrange the environment by putting in important libraries comparable to Transformers, Accelerate, and Nest Asyncio, which allow us to run native fashions and asynchronous duties seamlessly in Colab. We put together the runtime so that the upcoming elements of our agent can work effectively with out exterior dependencies. Check out the FULL CODES here.

Copy Code

class LocalLLM:
   def __init__(self, model_name="google/flan-t5-small", max_new_tokens=128):
       self.pipe = pipeline("text2text-generation", mannequin=model_name, system=0 if torch.cuda.is_available() else -1)
       self.max_new_tokens = max_new_tokens
   def generate(self, immediate: str) -> str:
       out = self.pipe(immediate, max_new_tokens=self.max_new_tokens, temperature=0.0)[0]["generated_text"]
       return out.strip()


class VirtualComputer:
   def __init__(self):
       self.apps = {"browser": "https://instance.com", "notes": "", "mail": ["Welcome to CUA", "Invoice #221", "Weekly Report"]}
       self.focus = "browser"
       self.display screen = "Browser open at https://instance.comnSearch bar targeted."
       self.action_log = []
   def screenshot(self):
       return f"FOCUS:{self.focus}nSCREEN:n{self.display screen}nAPPS:{checklist(self.apps.keys())}"
   def click on(self, goal:str):
       if goal in self.apps:
           self.focus = goal
           if goal=="browser":
               self.display screen = f"Browser tab: {self.apps['browser']}nAddress bar targeted."
           elif goal=="notes":
               self.display screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
           elif goal=="mail":
               inbox = "n".be a part of(f"- {s}" for s in self.apps['mail'])
               self.display screen = f"Mail App Inbox:n{inbox}n(Read-only preview)"
       else:
           self.display screen += f"nClicked '{goal}'."
       self.action_log.append({"sort":"click on","goal":goal})
   def sort(self, textual content:str):
       if self.focus=="browser":
           self.apps["browser"] = textual content
           self.display screen = f"Browser tab now at {textual content}nPage headline: Example Domain"
       elif self.focus=="notes":
           self.apps["notes"] += ("n"+textual content)
           self.display screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
       else:
           self.display screen += f"nTyped '{textual content}' however no editable discipline."
       self.action_log.append({"sort":"sort","textual content":textual content})

We outline the core elements, a light-weight native mannequin, and a digital laptop. We use Flan-T5 as our reasoning engine and create a simulated desktop that can open apps, show screens, and reply to typing and clicking actions. Check out the FULL CODES here.

Copy Code

class ComputerInstrument:
   def __init__(self, laptop:VirtualComputer):
       self.laptop = laptop
   def run(self, command:str, argument:str=""):
       if command=="click on":
           self.laptop.click on(argument)
           return {"standing":"accomplished","consequence":f"clicked {argument}"}
       if command=="sort":
           self.laptop.sort(argument)
           return {"standing":"accomplished","consequence":f"typed {argument}"}
       if command=="screenshot":
           snap = self.laptop.screenshot()
           return {"standing":"accomplished","consequence":snap}
       return {"standing":"error","consequence":f"unknown command {command}"}

We introduce the ComputerInstrument interface, which acts because the communication bridge between the agent’s reasoning and the digital desktop. We outline high-level operations comparable to click on, sort, and screenshot, enabling the agent to work together with the surroundings in a structured manner. Check out the FULL CODES here.

Copy Code

class ComputerAgent:
   def __init__(self, llm:LocalLLM, device:ComputerInstrument, max_trajectory_budget:float=5.0):
       self.llm = llm
       self.device = device
       self.max_trajectory_budget = max_trajectory_budget
   async def run(self, messages):
       user_goal = messages[-1]["content"]
       steps_remaining = int(self.max_trajectory_budget)
       output_events = []
       total_prompt_tokens = 0
       total_completion_tokens = 0
       whereas steps_remaining>0:
           display screen = self.device.laptop.screenshot()
           immediate = (
               "You are a computer-use agent.n"
               f"User purpose: {user_goal}n"
               f"Current display screen:n{display screen}nn"
               "Think step-by-step.n"
               "Reply with: ACTION <click on/sort/screenshot> ARG <goal or textual content> THEN <assistant message>.n"
           )
           thought = self.llm.generate(immediate)
           total_prompt_tokens += len(immediate.break up())
           total_completion_tokens += len(thought.break up())
           motion="screenshot"; arg=""; assistant_msg="Working..."
           for line in thought.splitlines():
               if line.strip().startswith("ACTION "):
                   after = line.break up("ACTION ",1)[1]
                   motion = after.break up()[0].strip()
               if "ARG " in line:
                   half = line.break up("ARG ",1)[1]
                   if " THEN " partially:
                       arg = half.break up(" THEN ")[0].strip()
                   else:
                       arg = half.strip()
               if "THEN " in line:
                   assistant_msg = line.break up("THEN ",1)[1].strip()
           output_events.append({"abstract":[{"text":assistant_msg,"type":"summary_text"}],"sort":"reasoning"})
           call_id = "call_"+uuid.uuid4().hex[:16]
           tool_res = self.device.run(motion, arg)
           output_events.append({"motion":{"sort":motion,"textual content":arg},"call_id":call_id,"standing":tool_res["status"],"sort":"computer_call"})
           snap = self.device.laptop.screenshot()
           output_events.append({"sort":"computer_call_output","call_id":call_id,"output":{"sort":"input_image","image_url":snap}})
           output_events.append({"sort":"message","position":"assistant","content material":[{"type":"output_text","text":assistant_msg}]})
           if "performed" in assistant_msg.decrease() or "right here is" in assistant_msg.decrease():
               break
           steps_remaining -= 1
       utilization = {"prompt_tokens": total_prompt_tokens,"completion_tokens": total_completion_tokens,"total_tokens": total_prompt_tokens + total_completion_tokens,"response_cost": 0.0}
       yield {"output": output_events, "utilization": utilization}

We assemble the ComputerAgent, which serves because the system’s clever controller. We program it to motive about objectives, resolve which actions to take, execute these by means of the device interface, and report every interplay as a step in its decision-making course of. Check out the FULL CODES here.

Copy Code

async def main_demo():
   laptop = VirtualComputer()
   device = ComputerInstrument(laptop)
   llm = LocalLLM()
   agent = ComputerAgent(llm, device, max_trajectory_budget=4)
   messages=[{"role":"user","content":"Open mail, read inbox subjects, and summarize."}]
   async for lead to agent.run(messages):
       print("==== STREAM RESULT ====")
       for occasion in consequence["output"]:
           if occasion["type"]=="computer_call":
               a = occasion.get("motion",{})
               print(f"[TOOL CALL] {a.get('sort')} -> {a.get('textual content')} [{event.get('status')}]")
           if occasion["type"]=="computer_call_output":
               snap = occasion["output"]["image_url"]
               print("SCREEN AFTER ACTION:n", snap[:400],"...n")
           if occasion["type"]=="message":
               print("ASSISTANT:", occasion["content"][0]["text"], "n")
       print("USAGE:", consequence["usage"])


loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())

We deliver every little thing collectively by working the demo, the place the agent interprets a person’s request and performs duties on the digital laptop. We observe it producing reasoning, executing instructions, updating the digital display screen, and reaching its purpose in a clear, step-by-step method.

In conclusion, we carried out the essence of a computer-use agent able to autonomous reasoning and interplay. We witness how native language fashions like Flan-T5 can powerfully simulate desktop-level automation inside a protected, text-based sandbox. This undertaking helps us perceive the structure behind clever brokers comparable to these in computer-use brokers, bridging pure language reasoning with digital device management. It lays a robust basis for extending these capabilities towards real-world, multimodal, and safe automation methods.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models appeared first on MarkTechPost.

How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models

A Coding Guide to Build an Intelligent Conversational AI Agent with Agent Memory Using Cognee and Free Hugging Face Models

How I Built an Intelligent Multi-Agent Systems with AutoGen, LangChain, and Hugging Face to Demonstrate Practical Agentic AI Workflows

NVIDIA AI Dev Team Releases Llama Nemotron Super v1.5: Setting New Standards in Reasoning and Agentic AI

Zhipu AI Just Released GLM-4.5 Series: Redefining Open-Source Agentic AI with Hybrid Reasoning

AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI Agent Development

How to Build a Fully Self-Verifying Data Operations AI Agent Using Local Hugging Face Models for Automated Planning, Execution, and Testing

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!