Two-Layer GUI Navigation Agent

This project is a Graphical User Interface (GUI) automation agent for macOS, designed with a two-layer architecture to separate visual perception from strategic decision-making. It can understand high-level user objectives, analyze the screen, and execute sequences of actions (clicks, typing, keyboard shortcuts) to achieve those goals.

🤖 Architecture

The agent relies on two collaborating Language Models (LLMs):

The Frontend VLM (Perception): A Vision Language Model that acts as the agent's "eyes." It receives a screenshot and a specific instruction from the supervisor. Its sole task is to analyze the image and propose a sequence of micro-actions (e.g., "click the button at position [x, y]," "type 'hello world'") in a strict JSON format.
- Model used (configurable): internvl3-8b-instruct
The Backend LLM (Strategy): A standard LLM that acts as the agent's "brain." It receives the user's overall goal, analyzes the VLM's output (or failure), evaluates if the plan is relevant, and makes the final decision to:
- Give a new instruction to the VLM to refine the action.
- Approve the action sequence proposed by the VLM for execution.
- Correct or propose its own action sequence if the VLM is stuck or making repeated errors.
- Determine if the task is complete or has failed.
- Model used (configurable): qwen/qwen3-8b

This separation of concerns delegates the complex visual analysis task to a specialized model, while using a more "generalist" and strategic LLM for logic, error correction, and long-term planning.

✨ Features

GUI Control: Automates clicks, double-clicks, text input, scrolling, and keyboard shortcuts.
Visual Feedback: Displays overlays on-screen to indicate which action is currently being executed.
Audio Feedback: Plays sounds to notify of different stages (new task, success, error).
Detailed Logging: Saves screenshots, model decisions, and executed actions for each step, facilitating debugging.
Flexible Configuration: Models and API endpoints are configurable via environment variables.
Robust Error Handling: The supervisor (Qwen) can detect when the VLM fails and attempt to correct course or re-issue instructions.

🛠️ Installation

Prerequisites

Python 3.8+
A local model server compatible with the OpenAI API (e.g., LM Studio, Ollama). You will need to load the required VLM and LLM models onto it.
macOS (as pyautogui and pynput behaviors can vary by OS).

Steps

Clone the repository:

git clone https://github.com/eauchs/gui-agent.git
cd gui-agent

Create a virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate

Install the Python dependencies:
```
pip install -r requirements.txt
```

⚙️ Configuration

The agent is configured using environment variables. You can set them in your terminal before running the script or use a .env file (with pip install python-dotenv).

API Endpoint: Ensure your local server is running. The default URL is http://localhost:1234/v1.
```
export OPENAI_API_BASE_URL="http://localhost:1234/v1"
```

Model Names: These names must exactly match those loaded in your local server.

# Model for visual analysis (VLM)
export VLM_MODEL_NAME_FOR_API="internvl3-8b-instruct"

# Model for strategy (LLM)
export QWEN_MODEL_NAME_FOR_API="qwen/qwen3-8b"

▶️ Launch

Once dependencies are installed and environment variables are set, run the main script from your terminal:

python autonomous_gui_agent.py

The agent will prompt you to enter a global objective.

Example Objectives:

"Open Chrome, go to https://www.google.com/search?q=google.com and search for images of cute cats."
"Open the terminal, list the files in the current directory, then create a new folder called 'test_agent'."
"Check if there are any system updates available in System Preferences."

To stop the agent, you can type exit or quit when prompted for an objective, or use Ctrl+C in the terminal.

📝 Generated Files & Folders

During execution, the agent automatically creates:

agent_gui_screenshots_api/: A folder containing a screenshot for each step of the task.
agent_gui_screenshots_api/detailed_interaction_log.txt: A highly detailed log file, recording prompts, raw model responses, and executed actions. Useful for debugging.
audio_feedback/: Contains the sound files generated for audio feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
autonomous_gui_agent.py		autonomous_gui_agent.py
requirements.txt		requirements.txt
test 2		test 2
v4		v4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Two-Layer GUI Navigation Agent

🤖 Architecture

✨ Features

🛠️ Installation

⚙️ Configuration

▶️ Launch

📝 Generated Files & Folders

About

Uh oh!

Releases

Packages

Languages

eauchs/gui-agent

Folders and files

Latest commit

History

Repository files navigation

Two-Layer GUI Navigation Agent

🤖 Architecture

✨ Features

🛠️ Installation

⚙️ Configuration

▶️ Launch

📝 Generated Files & Folders

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages