🧠 MUSE: A Memory-Utilizing and Self-Evolving Agent

Learning on the Job: An Experience-Driven, Self-Evolving Agent for Long-Horizon Tasks
📄 Paper on arXiv (2510.08002)

✨ Abstract

Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.

🏆 Benchmark Performance

MUSE ranks #1 on The Agent Company Benchmark Leaderboard.

🚀 Quick Start

Step 1: Environment Setup

conda create -n MUSE python=3.12
conda activate MUSE
pip install -r requirements.txt
playwright install chromium
playwright install-deps chromium

Step 2: Run Local Demo

python demo.py

🧪 Run TAC Benchmark

To evaluate MUSE on The Agent Company Benchmark, please follow the detailed setup in: 👉 TheAgentCompanyForMuse Repository

🎥 Demo Showcase

Task 1: HR - Internal Tooling Slides

Task 2: PM - Updates Plane Issue from GitLab Status

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
memory		memory
misc		misc
prompt		prompt
toolbox		toolbox
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
browser.py		browser.py
config.yaml		config.yaml
demo.py		demo.py
log.py		log.py
memory_manager.py		memory_manager.py
model.py		model.py
monitor.py		monitor.py
report.py		report.py
requirements.txt		requirements.txt
run.py		run.py
tool.py		tool.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 MUSE: A Memory-Utilizing and Self-Evolving Agent

✨ Abstract

🏆 Benchmark Performance

🚀 Quick Start

Step 1: Environment Setup

Step 2: Run Local Demo

🧪 Run TAC Benchmark

🎥 Demo Showcase

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

KnowledgeXLab/MUSE

Folders and files

Latest commit

History

Repository files navigation

🧠 MUSE: A Memory-Utilizing and Self-Evolving Agent

✨ Abstract

🏆 Benchmark Performance

🚀 Quick Start

Step 1: Environment Setup

Step 2: Run Local Demo

🧪 Run TAC Benchmark

🎥 Demo Showcase

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages