This project is a Graphical User Interface (GUI) automation agent for macOS, designed with a two-layer architecture to separate visual perception from strategic decision-making. It can understand high-level user objectives, analyze the screen, and execute sequences of actions (clicks, typing, keyboard shortcuts) to achieve those goals.
The agent relies on two collaborating Language Models (LLMs):
-
The Frontend VLM (Perception): A Vision Language Model that acts as the agent's "eyes." It receives a screenshot and a specific instruction from the supervisor. Its sole task is to analyze the image and propose a sequence of micro-actions (e.g., "click the button at position [x, y]," "type 'hello world'") in a strict JSON format.
- Model used (configurable):
internvl3-8b-instruct
- Model used (configurable):
-
The Backend LLM (Strategy): A standard LLM that acts as the agent's "brain." It receives the user's overall goal, analyzes the VLM's output (or failure), evaluates if the plan is relevant, and makes the final decision to:
- Give a new instruction to the VLM to refine the action.
- Approve the action sequence proposed by the VLM for execution.
- Correct or propose its own action sequence if the VLM is stuck or making repeated errors.
- Determine if the task is complete or has failed.
- Model used (configurable):
qwen/qwen3-8b
This separation of concerns delegates the complex visual analysis task to a specialized model, while using a more "generalist" and strategic LLM for logic, error correction, and long-term planning.
- GUI Control: Automates clicks, double-clicks, text input, scrolling, and keyboard shortcuts.
- Visual Feedback: Displays overlays on-screen to indicate which action is currently being executed.
- Audio Feedback: Plays sounds to notify of different stages (new task, success, error).
- Detailed Logging: Saves screenshots, model decisions, and executed actions for each step, facilitating debugging.
- Flexible Configuration: Models and API endpoints are configurable via environment variables.
- Robust Error Handling: The supervisor (Qwen) can detect when the VLM fails and attempt to correct course or re-issue instructions.
Prerequisites
- Python 3.8+
- A local model server compatible with the OpenAI API (e.g., LM Studio, Ollama). You will need to load the required VLM and LLM models onto it.
- macOS (as
pyautoguiandpynputbehaviors can vary by OS).
Steps
-
Clone the repository:
git clone https://github.com/eauchs/gui-agent.git cd gui-agent -
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate -
Install the Python dependencies:
pip install -r requirements.txt
The agent is configured using environment variables. You can set them in your terminal before running the script or use a .env file (with pip install python-dotenv).
-
API Endpoint: Ensure your local server is running. The default URL is
http://localhost:1234/v1.export OPENAI_API_BASE_URL="http://localhost:1234/v1"
-
Model Names: These names must exactly match those loaded in your local server.
# Model for visual analysis (VLM) export VLM_MODEL_NAME_FOR_API="internvl3-8b-instruct" # Model for strategy (LLM) export QWEN_MODEL_NAME_FOR_API="qwen/qwen3-8b"
Once dependencies are installed and environment variables are set, run the main script from your terminal:
python autonomous_gui_agent.pyThe agent will prompt you to enter a global objective.
Example Objectives:
- "Open Chrome, go to
https://www.google.com/search?q=google.comand search for images of cute cats." - "Open the terminal, list the files in the current directory, then create a new folder called 'test_agent'."
- "Check if there are any system updates available in System Preferences."
To stop the agent, you can type exit or quit when prompted for an objective, or use Ctrl+C in the terminal.
During execution, the agent automatically creates:
agent_gui_screenshots_api/: A folder containing a screenshot for each step of the task.agent_gui_screenshots_api/detailed_interaction_log.txt: A highly detailed log file, recording prompts, raw model responses, and executed actions. Useful for debugging.audio_feedback/: Contains the sound files generated for audio feedback.